The large-scale nature of modern machine learning makes distributed computation indispensable. Gradient-based optimization methods have the potential for massive speed-ups in distributed setups. Unfortunately, distributed machine learning is often beset by computational bottlenecks. One such bottleneck is the presence of straggler nodes, worker nodes that run considerably slower than others. Recently, gradient coding, the use of coding-theoretic techniques for straggler mitigation, has been proposed. While prior work has mainly focused on on exact reconstruction of the desired output, slightly inexact computations can be acceptable in applications that are robust to noise, such as distributed model training. We will present a gradient coding scheme based on sparse random graphs that guarantees fast and approximately accurate distributed computations, especially for gradient-based algorithms. We show that by sacrificing a small amount of accuracy, we can greatly increase robustness to stragglers.
Discovery Building, Orchard View Room
Wooseok Ha, Zachary Charles