Connected Components in MapReduce and Beyond

Name: Connected Components in MapReduce and Beyond
Start: 2017-08-30T18:00:00-04:00
End: 2017-08-30T21:00:00-04:00
Location: Foulab

Hosted By Papers We Love, Montreal

public group

Details

Mathieu Guay-Paquet will present Kiveris et al's Connected Components in MapReduce and Beyond (http://delivery.acm.org/10.1145/2680000/2670997/18-kiveris.pdf). As I understand it, there'll be a bit of a war story here, about improving the efficiency of a Spark job that computes connected components, originally with GraphX. Correspondingly, Mathieu will talk about Spark and RDDs, starting with what I think is the original RDD paper, then GraphX, and then some actual distributed connected components approaches.

I think anyone interested in how Spark works, or doing computations on large graphs, will find something of interest here.

He wrote me a nice list of the papers and comments which I will reproduce exactly (modulo meetup.com's strange formatting):

This one is a very nice overview of the way Spark and its main data structure, RDDs, are meant to be used, with enough implementation details to understand what's going on. It's probably a very good roadmap if you decide to dive into the Spark code itself.

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Schenker, Ion Stoica — NSDI 2012

web: https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia

pdf: https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

Similarly, this is a nice overview of the intent and architecture of the Spark GraphX library, including a few benchmarks comparing it to other frameworks on a graph connected components task and a PageRank task. It contains all the ingredients to see why its approach is poorly suited to the connected components task, but doesn't actually reach this conclusion.

GraphX: Unifying Data-Parallel and Graph-Parallel Analytics

Reynold S. Xin, Daniel Crankshaw, Ankur Dave, Joseph E. Gonzalez, Michael J. Franklin, Ion Stoica — OSDI 2014

web: https://www.usenix.org/conference/osdi14/technical-sessions/presentation/gonzalez

pdf: https://arxiv.org/pdf/1402.2394.pdf

This is the paper that found that you can really reduce the number of rounds of map-reduce you can do to compute connected components if you are willing to update your graph structure at each iteration.

Finding Connected Components in Map-Reduce in Logarithmic Rounds

Vibhor Rastogi, Ashwin Machanavajjhala, Laukik Chitnis, Anish Das Sarma — ICDE 2013

web: http://ieeexplore.ieee.org/document/6544813/ (http://ieeexplore.ieee.org/document/6544813/pdf:)

pdf: (http://ieeexplore.ieee.org/document/6544813/pdf:) https://arxiv.org/pdf/1203.5387.pdf

This paper fixes the problem with the previous paper where your number of edges can roughly double at each iteration if you're not careful, by choosing either the small-star operation or the large-star operation at each iteration.

Connected Components in MapReduce and Beyond

Raimondas Kiveris, Silvio Lattanzi, Vahab Mirrokni, Vibhor Rastogi, Sergei Vassilvitskii — SOCC 2014

pdf: https://doi.org/10.1145/2670979.2670997

Events in Montréal, QC