The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks. I will first discuss some general mathematical principles allowing for efficient optimization in over-parameterized non-linear systems, a setting that includes deep neural networks. I will argue that optimization problems corresponding to these systems are not convex, even locally, but instead satisfy the Polyak-Lojasiewicz (PL) condition on most of the parameter space, allowing for efficient optimization by gradient descent or SGD.

As a separate but related development, I will talk about the remarkable recently discovered phenomenon of transition to linearity (constancy of NTK), when networks become linear functions of their parameters as their width increases. In particular I will talk about a quite general form of the transition to linearity for a broad class of feedforward networks corresponding to arbitrary directed graphs. It turns out that the width of such networks is characterized by the minimum in-degree of their graphs, excluding the input layer and the first layer.

Finally, I will mention a very interesting deviation from linearity, a so-called “catapult phase”, a recently identified non-linear and, furthermore, non-perturbatative phenomenon, which persists even as neural networks become increasingly linear in the limit of the increasing width.

Orchard View Room, Virtual

Misha Belkin