Connected Components in MapReduce and Beyond

This is a past event

36 people went

Location image of event venue


Mathieu Guay-Paquet will present Kiveris et al's Connected Components in MapReduce and Beyond ( As I understand it, there'll be a bit of a war story here, about improving the efficiency of a Spark job that computes connected components, originally with GraphX. Correspondingly, Mathieu will talk about Spark and RDDs, starting with what I think is the original RDD paper, then GraphX, and then some actual distributed connected components approaches.

I think anyone interested in how Spark works, or doing computations on large graphs, will find something of interest here.

He wrote me a nice list of the papers and comments which I will reproduce exactly (modulo's strange formatting):

1. This one is a very nice overview of the way Spark and its main data structure, RDDs, are meant to be used, with enough implementation details to understand what's going on. It's probably a very good roadmap if you decide to dive into the Spark code itself.

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Schenker, Ion Stoica — NSDI 2012



2. Similarly, this is a nice overview of the intent and architecture of the Spark GraphX library, including a few benchmarks comparing it to other frameworks on a graph connected components task and a PageRank task. It contains all the ingredients to see why its approach is poorly suited to the connected components task, but doesn't actually reach this conclusion.

GraphX: Unifying Data-Parallel and Graph-Parallel Analytics

Reynold S. Xin, Daniel Crankshaw, Ankur Dave, Joseph E. Gonzalez, Michael J. Franklin, Ion Stoica — OSDI 2014



3. This is the paper that found that you can really reduce the number of rounds of map-reduce you can do to compute connected components if you are willing to update your graph structure at each iteration.

Finding Connected Components in Map-Reduce in Logarithmic Rounds

Vibhor Rastogi, Ashwin Machanavajjhala, Laukik Chitnis, Anish Das Sarma — ICDE 2013

web: (

pdf: (

4. This paper fixes the problem with the previous paper where your number of edges can roughly double at each iteration if you're not careful, by choosing either the small-star operation or the large-star operation at each iteration.

Connected Components in MapReduce and Beyond

Raimondas Kiveris, Silvio Lattanzi, Vahab Mirrokni, Vibhor Rastogi, Sergei Vassilvitskii — SOCC 2014