Overcoming the Challenges of Learning in Parallel
Distributed implementations of popular machine learning algorithms exhibit poor scaling when deployed on more than a few tens of compute nodes. The key sources of this poor performance are communication bottlenecks and straggler nodes in the system. In this talk, I will explain why these bottlenecks are a real challenge …