SILO: Theory and practice of LLM quantization

Abstract

Modern LLMs process information by repeatedly applying a basic primitive of matrix multiplication. Estimates show that about 60-84% of the energy consumed by LLMs goes into memory load/store operations. How can we reduce this power consumption? LLM converts text into a sequence of tokens (which can be thought as 16-18 bit integers), which subsequently get mapped to vectors of floats of length in the 1000s, suggesting very low information density per dimension. Thus, unsurprisingly there has been much success in reducing precision of both weights and activations without much loss in LLM performance. In this talk we will present information-theoretic analysis of quantized representations as well as discuss algorithmic improvements (NestQuant, ICML’2025) that originated from it.

Bio

Yury Polyanskiy is a Cutten Professor of Electrical Engineering and Computer Science, a member of IDSS and LIDS at MIT, and an IEEE Fellow. Yury received Ph.D. degree in electrical engineering from Princeton University, Princeton, NJ in 2010. His research interests span information theory, machine learning and statistics. Dr. Polyanskiy won the 2020 IEEE Information Theory Society James Massey Award, 2013 NSF CAREER award and 2011 IEEE Information Theory Society Paper Award.

January 21, 2026

12:30 pm (1h)

Orchard View Room

MIT, Yury Polanskiy