Stochastic gradient descent (SGD) approximates the objective function’s gradient with a constant and typically small number of examples a.k.a. the batch size of mini-batch SGD. Small batch sizes can present a significant amount of noise near the optimum. This work presents a method to grow the batch adaptively with model quality that requires no more computation than standard SGD. With this method, convergence is significantly improved for strongly convex, convex and non-convex functions in terms of the number of model updates. This method is easier to tune because the hyper-parameters have no time dependence and is more amenable to distributed systems. Simulations and experiments are performed to confirm and extend theoretical results.
July 25, 2019
4:00 pm (1h)
Memorial Library, Room 126
Scott Sievert