Past Meetup

Resilient Distributed Datasets

This Meetup is past

29 people went

Location image of event venue


Marco Rietveld presents "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing"

About the Paper

In the early 2010s, a team at UC Berkley (Zaharia et al.) developed Apache Spark to analyze large data sets because 2 challenges were not being met by Apache Hadoop and map-reduce. It was (1) hard to do complex multi-stage or iterative analysis of the data and (2) hard to do interactive ad-hoc queries on the data. While the team at UC Berkeley was working on this problem, they hit upon RDDs and Apache Spark was born!

This paper presents the Resilient Distributed Dataset [RDD] abstraction, the primary data abstraction in Apache Spark. RDDs allow Spark to perform much faster than Hadoop and introduce a model that allows programmers and data analysts to do the 2 tasks mentioned above. In short, an RDD is a resilient, immutable, partitioned and distributed collection of elements that can be operated on in parallel.

If you know a little bit about map-reduce and are interested in big data processing, this talk is perfect for you! Among other things, I'll be quickly going over the PageRank algorithm.

About the Speaker

Marco is a Java Software Engineer at Luminis and is very interested in Data Engineering, Data Science and IoT integrations. He stays as far as possible away from anything having to do with the front-end, reads too much sci-fi and can throw frisbees really well. He lives in Utrecht.

This talk is targeted at those who have at least a basic familiarity with programming language and understand how the map-reduce algorithm works (it's pretty simple, just google it if you don't). Some understanding of the challenges of distributed systems or distributed data processing will be helpful but not necessary.