Learning from Societal Data: Theory and Practice

Machine learning algorithms for policy and decision making are becoming ubiquitous. In many societal applications, the inferences we can draw are often severely limited not by the number of subjects in the data but rather by limited observations available for each subject. My research focuses on tackling these limitations both from theoretical and practical perspectives. In this talk, I will focus on two instances:
(i) Many scientific domains such as social sciences and epidemiology study heterogeneous populations (varying demographics) with only a few (small) observations available at the level of individuals. We illustrate using real world data that not accounting for heterogeneity and sparsity can lead to false conclusions. We then investigate a fundamental and practically relevant problem of learning from a heterogeneous population with sparse data. While the maximum likelihood estimator (MLE) is widely used, its optimality and sample complexity under sparsity were not well understood. We prove that the MLE is optimal even in the sparse regime, resolving this problem open since the 1960s. We then use these results to construct new, optimal estimators for learning the “change” before and after a policy is introduced.
(ii) Data available in abundance is often unlabeled. Labels required for supervised learning tasks are obtained from humans who may not be domain experts. This limits the type and accuracy of labels that can be obtained. To overcome this limitation, we propose an approach for query design that leverages tasks that do not require expertise. Specifically, by clustering answers to easier “comparison queries”. Exploiting insights from graph clustering, we show in practice that for a fixed cost, our query design can double the amount of data collected while simultaneously reducing errors by 20%.

Ramya Korlakai Vinayak is an assistant professor in the Dept. of ECE at the UW-Madison. Her research interests span the areas of machine learning, statistical inference, and crowdsourcing. Her work focuses on addressing theoretical and practical challenges that arise when learning from societal data. Prior to joining UW-Madison, Ramya was a postdoctoral researcher in the Paul G. Allen School of Computer Science and Engineering at the University of Washington. She received her Ph.D. in Electrical Engineering from Caltech. She is a recipient of the Schlumberger Foundation Faculty of the Future fellowship from 2013-15, and an invited participant at the Rising Stars in EECS workshop in 2019. She obtained her Masters from Caltech and Bachelors from IIT Madras.

November 4 @ 12:30
12:30 pm (1h)


Ramya Vinayak