Vivek Farias

SILO: Preference Modeling for LLM Alignment under Heterogeneity

Abstract LLM alignment methods typically learn a single reward model (either implicitly or explicitly) from pairwise comparison data. This approach implicitly assumes homogeneous preferences across human labelers — an assumption that is violated in practice. As a result, the learned reward model is generally mis-specified: Prior work shows that it …