Yury Polanskiy

SILO: Theory and practice of LLM quantization

Abstract Modern LLMs process information by repeatedly applying a basic primitive of matrix multiplication. Estimates show that about 60-84% of the energy consumed by LLMs goes into memory load/store operations. How can we reduce this power consumption? LLM converts text into a sequence of tokens (which can be thought as …