Learning from Societal Data: Theory and Practice

Machine learning algorithms for policy and decision making are becoming ubiquitous. In many societal applications, the inferences we can draw are often severely limited not by the number of subjects in the data but rather by limited observations available for each subject. My research focuses on tackling these limitations both from theoretical and practical perspectives. In this talk, I will focus on two instances: (i) Many scientic domains such as social sciences and epidemiology study heterogeneous populations (varying demographics) with only a few observations available at the level of individuals (sparse). We illustrate using real world data that not accounting for heterogeneity and sparsity can lead to false conclusions. We then investigate a fundamental and practically relevant problem of learning from a heterogeneous population with sparse data. While the maximum likelihood estimator (MLE) is widely used, its optimality and sample complexity under sparsity were not well understood. We prove that the MLE is optimal even in the sparse regime, resolving this problem open since the 1960s. We then use these results to construct new, optimal estimators for learning the “change” before and after a policy is introduced. (ii) Data available in abundance is often unlabeled. Labels required for supervised learning tasks are obtained from humans who may not be domain experts. This limits the type and accuracy of labels that can be obtained. To overcome this limitation, we propose an approach for query design that leverages tasks that do not require expertise. Specically, by clustering answers to easier “comparison queries”. Exploiting insights from graph clustering, we show in practice that for a xed cost, our query design can double the amount of data collected while simultaneously reducing errors by 20%.
March 3 @ 12:30
12:30 pm (1h)

Discovery Building, Orchard View Room

Ramya Vinayak