Past Meetup

Connected Components in MapReduce and Beyond

This Meetup is past

36 people went

Location image of event venue

Details

Mathieu Guay-Paquet will present Kiveris et al's Connected Components in MapReduce and Beyond (http://delivery.acm.org/10.1145/2680000/2670997/18-kiveris.pdf). As I understand it, there'll be a bit of a war story here, about improving the efficiency of a Spark job that computes connected components, originally with GraphX. Correspondingly, Mathieu will talk about Spark and RDDs, starting with what I think is the original RDD paper, then GraphX, and then some actual distributed connected components approaches.

I think anyone interested in how Spark works, or doing computations on large graphs, will find something of interest here.

He wrote me a nice list of the papers and comments which I will reproduce exactly (modulo meetup.com's strange formatting):

1. This one is a very nice overview of the way Spark and its main data structure, RDDs, are meant to be used, with enough implementation details to understand what's going on. It's probably a very good roadmap if you decide to dive into the Spark code itself.

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Schenker, Ion Stoica — NSDI 2012

web: https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia

pdf: https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

2. Similarly, this is a nice overview of the intent and architecture of the Spark GraphX library, including a few benchmarks comparing it to other frameworks on a graph connected components task and a PageRank task. It contains all the ingredients to see why its approach is poorly suited to the connected components task, but doesn't actually reach this conclusion.

GraphX: Unifying Data-Parallel and Graph-Parallel Analytics

Reynold S. Xin, Daniel Crankshaw, Ankur Dave, Joseph E. Gonzalez, Michael J. Franklin, Ion Stoica — OSDI 2014

web: https://www.usenix.org/conference/osdi14/technical-sessions/presentation/gonzalez

pdf: https://arxiv.org/pdf/1402.2394.pdf

3. This is the paper that found that you can really reduce the number of rounds of map-reduce you can do to compute connected components if you are willing to update your graph structure at each iteration.

Finding Connected Components in Map-Reduce in Logarithmic Rounds

Vibhor Rastogi, Ashwin Machanavajjhala, Laukik Chitnis, Anish Das Sarma — ICDE 2013

web: http://ieeexplore.ieee.org/document/6544813/ (http://ieeexplore.ieee.org/document/6544813/pdf:)

pdf: (http://ieeexplore.ieee.org/document/6544813/pdf:) https://arxiv.org/pdf/1203.5387.pdf

4. This paper fixes the problem with the previous paper where your number of edges can roughly double at each iteration if you're not careful, by choosing either the small-star operation or the large-star operation at each iteration.

Connected Components in MapReduce and Beyond

Raimondas Kiveris, Silvio Lattanzi, Vahab Mirrokni, Vibhor Rastogi, Sergei Vassilvitskii — SOCC 2014

pdf: https://doi.org/10.1145/2670979.2670997