Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD

Distributed Stochastic Gradient Descent (SGD) when run in a synchronous manner, suffers from delays in waiting for the slowest workers (stragglers). Asynchronous methods can alleviate stragglers, but cause gradient staleness that can adversely affect convergence. In this work, we present the first theoretical characterization of the speed-up offered by asynchronous methods by analyzing the trade-off between the error in the trained model and the actual training runtime (wallclock time). In the second part of the talk, I will discuss a unified convergence analysis of communication-efficient distributed SGD algorithms, which include federated, elastic and decentralized averaging. The novelty in our work is that our runtime analysis considers random gradient computation and communication delays, which helps us design and compare distributed SGD algorithms that achieve the fastest true convergence with respect to wall-clock time.

https://vimeo.com/304378557

November 28, 2018

12:30 pm (1h)

Discovery Building, Orchard View Room

Gauri Josh