A Tale Of Three Apache Spark APIs: RDDs, DataFrames, and Datasets


Speaker: Jules Damji, Databricks

Wednesday, November 8, 2017

5:45 pm (Social Hour, light refreshments)
6:30 pm Presentation

Apache Spark is an open-source cluster-computing framework that provides programming interfaces for large-scale data processing with parallelism and fault-tolerance.

Of all the developers’ delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and are both intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three key takeaways:
Why and when one should use each set as best practices
Outline Apache Spark's performance and optimization benefits
Underscore scenarios of when to use DataFrames and Datasets instead of RDDs for your big data distributed processing.
Through simple notebook demonstrations with API code examples, you will learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. This will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs.

Jules S. Damji is an Apache Spark Community Evangelist and Developer Advocate at Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies building large-scale distributed systems. He holds a B.Sc and M.Sc in Computer Science and MA in Political Advocacy and Communication from Oregon State University, Cal State, and Johns Hopkins University respectively.

While there will be light refreshments available, feel free to "brown bag" it and bring in food from the outside to eat during the social hour.