Abstract
Given the massive scale of modern ML models, we now only get a single shot to train them effectively. This restricts our ability to test multiple architectures and hyper-parameter configurations. Instead, we need to understand how these models scale, allowing us to experiment with smaller problems and then apply those insights to larger-scale models. In this talk, I will present a framework for analyzing scaling laws in stochastic learning algorithms using a power-law random features model (PLRF), leveraging high-dimensional probability and random matrix theory. I will then use this scaling law to address the compute-optimal question: How should we choose model size and hyper-parameters to achieve the best possible performance in the most compute-efficient manner? Then using this PLRF model, I will devise a new momentum-based algorithm that (provably) improves the scaling law exponent. Finally, I will present some numerical experiments on LSTMs that show how this new stochastic algorithm can be applied to real data to improve the compute-optimal exponent.
Bio
Dr. Courtney Paquette is an Assistant Professor of Mathematics and Statistics at McGill University, Montreal, Quebec, Canada. Dr. Paquette received her PhD from the Mathematics Department at the University of Washington. Dr. Paquette’s research broadly focuses on designing and analyzing algorithms for large-scale optimization problems, motivated by applications in data science. The techniques Dr. Paquette uses draw from a variety of fields including probability, complexity theory, convex and non-smooth analysis as well as the study of scaling limits of stochastic algorithms.
Orchard View Room
Courtney Paquette, McGill University