Abstract
LLM alignment methods typically learn a single reward model (either implicitly or explicitly) from pairwise comparison data. This approach implicitly assumes homogeneous preferences across human labelers — an assumption that is violated in practice. As a result, the learned reward model is generally mis-specified: Prior work shows that it is inconsistent with the population-average utility, incurring large distortion, and that recovering the average utility is provably impossible in the worst case. In this work, we show that the average utility is recoverable under a relatively mild assumption. Our accompanying estimator is a potentially surprising repurposing of Manski’s Maximum Score Estimator: we simply replace the standard cross-entropy loss function in reward learning pipelines with a notion of binary classification loss. We show that doing so: (1) recovers a reward model that is ordinally consistent with the population-average utility, (2) recovers Nash-Learing with Human Feedback and (3) enjoys dimension-dependent finite-sample convergence rates that are the first of their kind in the context of this estimator.
Joint work with Ali Aouad and Aymane El Gadarri
Bio
Vivek is interested in the development of new methodologies and applications for large scale dynamic optimization. He received his Ph.D. in Electrical Engineering from Stanford University in 2007 and is the Patrick J. McGovern (1959) Professor at MIT. Vivek is a recipient of an INFORMS MSOM Student Paper Prize (2006), an INFORMS JFIG paper prize (2009, 2011), the NSF CAREER award (2011), MIT Sloan’s Outstanding Teacher award (2013), the INFORMS Simulation Society Best Publication Award (2014), the INFORMS Pricing and Revenue Management Best Publication Award (2015), the INFORMS MSOM Best Publication award in Management Science (2016), the MSOM Young Scholar Prize (2020), the Wagner prize (2022), the Pierskalla award (2024), and is an Informs Fellow (2025). Vivek’s doctoral advisees have on various occasions won the Nicholson, MSOM, APS and RMP student paper prizes. Outside of academia, Vivek was co-founder/CTO at Celect (2014-19; acquired by Nike); was a corresponding author of the technology at Seer (2018-2020; IPO). Most recently, he was co-founder/CTO at Cimulate (2023-26; acquired by Salesforce) and is currently head of science for Salesforce Commerce Cloud.
Orchard View Room
MIT, Vivek Farias