Resilient Distributed Datasets

Name: Resilient Distributed Datasets
Start: 2018-05-24T19:00:00+02:00
End: 2018-05-24T21:00:00+02:00
Location: Driebit

Hosted by Michel R.

Papers We Love Amsterdam

Details

Marco Rietveld presents "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing"

https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

About the Paper

In the early 2010s, a team at UC Berkley (Zaharia et al.) developed Apache Spark to analyze large data sets because 2 challenges were not being met by Apache Hadoop and map-reduce. It was (1) hard to do complex multi-stage or iterative analysis of the data and (2) hard to do interactive ad-hoc queries on the data. While the team at UC Berkeley was working on this problem, they hit upon RDDs and Apache Spark was born!

This paper presents the Resilient Distributed Dataset [RDD] abstraction, the primary data abstraction in Apache Spark. RDDs allow Spark to perform much faster than Hadoop and introduce a model that allows programmers and data analysts to do the 2 tasks mentioned above. In short, an RDD is a resilient, immutable, partitioned and distributed collection of elements that can be operated on in parallel.

If you know a little bit about map-reduce and are interested in big data processing, this talk is perfect for you! Among other things, I'll be quickly going over the PageRank algorithm.

About the Speaker

Marco is a Java Software Engineer at Luminis and is very interested in Data Engineering, Data Science and IoT integrations. He stays as far as possible away from anything having to do with the front-end, reads too much sci-fi and can throw frisbees really well. He lives in Utrecht.

This talk is targeted at those who have at least a basic familiarity with programming language and understand how the map-reduce algorithm works (it's pretty simple, just google it if you don't). Some understanding of the challenges of distributed systems or distributed data processing will be helpful but not necessary.

Papers We Love Amsterdam

Resilient Distributed Datasets

Papers We Love Amsterdam

Details

Related topics

You may also like