Large-scale Problems on a Tight Budget

For all problems we provide some sort of guarantee that is much better than worst-case. For the first part of this talk, I will present our work on streaming, memory-limited Principal Component Analysis. Therein, we give the first finite-sample, global-convergence guarantees for an algorithm that solves PCA in the single-pass streaming setting. The algorithm is shown to exhibit optimal sample complexity, within logarithmic factors, under the spiked covariance model. It uses optimal memory, when the true components are dense. Then, motivated by storage and network capacity limitations, I will propose a method for fast PageRank approximations on graph engines. Our system uses random walks as a quantized alternative to the power iteration, eliminates the need of global jumps for PageRank, and introduces randomized synchronization. This last contribution is a novel extension to the GraphLab engine that could be useful to other workloads as well, as it reduces network usage. We achieve a very good approximation and a serious speedup (7-10x) over the vanilla system. Our theory bounds the resulting approximation error. Finally, I will discuss a new method for finding dense subgraphs in very large graphs; a significant improvement over the previous state of the art. The problem of the densest k-subgraph is NP-hard and hard to approximate. Our algorithm furnishes a k-subgraph, along with a graph-dependent guarantee that tells us that, in most real graphs we test, we achieve at least 70% of the optimal density. It scales almost linearly in the size of the graph and parallelizes well. Our MapReduce implementation works on billion-edge graphs.

March 25, 2015

12:30 pm (1h)

Discovery Building, Orchard View Room

Ioannis MItliagkas