Distributed implementations of popular machine learning algorithms exhibit poor scaling when deployed on more than a few tens of compute nodes. The key sources of this poor performance are communication bottlenecks and straggler nodes in the system. In this talk, I will explain why these bottlenecks are a real challenge for scaling up, and will provide insights on how to overcome them using simple algebraic ideas. I will show experiments where simple theoretical insights can lead to distributed training algorithms with significant speedup gains, and will conclude with several open problems that lie in the intersection of machine learning and distributed systems.
Video: https://vimeo.com/241040532
October 25, 2017
12:30 pm (1h)
Discovery Building, Orchard View Room
Dimitris Papailiopoulos