Connected Components in MapReduce and Beyond

Details
Mathieu Guay-Paquet will present Kiveris et al's Connected Components in MapReduce and Beyond (http://delivery.acm.org/10.1145/2680000/2670997/18-kiveris.pdf). As I understand it, there'll be a bit of a war story here, about improving the efficiency of a Spark job that computes connected components, originally with GraphX. Correspondingly, Mathieu will talk about Spark and RDDs, starting with what I think is the original RDD paper, then GraphX, and then some actual distributed connected components approaches.
I think anyone interested in how Spark works, or doing computations on large graphs, will find something of interest here.
He wrote me a nice list of the papers and comments which I will reproduce exactly (modulo meetup.com's strange formatting):
- This one is a very nice overview of the way Spark and its main data structure, RDDs, are meant to be used, with enough implementation details to understand what's going on. It's probably a very good roadmap if you decide to dive into the Spark code itself.
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Schenker, Ion Stoica — NSDI 2012
web: https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia
pdf: https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
- Similarly, this is a nice overview of the intent and architecture of the Spark GraphX library, including a few benchmarks comparing it to other frameworks on a graph connected components task and a PageRank task. It contains all the ingredients to see why its approach is poorly suited to the connected components task, but doesn't actually reach this conclusion.
GraphX: Unifying Data-Parallel and Graph-Parallel Analytics
Reynold S. Xin, Daniel Crankshaw, Ankur Dave, Joseph E. Gonzalez, Michael J. Franklin, Ion Stoica — OSDI 2014
web: https://www.usenix.org/conference/osdi14/technical-sessions/presentation/gonzalez
pdf: https://arxiv.org/pdf/1402.2394.pdf
- This is the paper that found that you can really reduce the number of rounds of map-reduce you can do to compute connected components if you are willing to update your graph structure at each iteration.
Finding Connected Components in Map-Reduce in Logarithmic Rounds
Vibhor Rastogi, Ashwin Machanavajjhala, Laukik Chitnis, Anish Das Sarma — ICDE 2013
web: http://ieeexplore.ieee.org/document/6544813/ (http://ieeexplore.ieee.org/document/6544813/pdf:)
pdf: (http://ieeexplore.ieee.org/document/6544813/pdf:) https://arxiv.org/pdf/1203.5387.pdf
- This paper fixes the problem with the previous paper where your number of edges can roughly double at each iteration if you're not careful, by choosing either the small-star operation or the large-star operation at each iteration.
Connected Components in MapReduce and Beyond
Raimondas Kiveris, Silvio Lattanzi, Vahab Mirrokni, Vibhor Rastogi, Sergei Vassilvitskii — SOCC 2014

Connected Components in MapReduce and Beyond