Brian Clapper: RDDs, DataFrames and Datasets in Apache Spark


Brian will reprise his Northeast Scala Symposium talk, for those who missed it—with a little new content related to Spark 2.0.

Azavea has graciously agreed, once again, to host this meetup in their awesome new HQ at 990 Spring Garden Street.

Vistar Media ( will be sponsoring the pizza and drinks.

Talk abstract:

Traditionally, Apache Spark jobs have been written using Resilient Distributed Datasets (RDDs), a Scala Collections-like API. RDDs are type-safe, but they can be problematic: It's easy to write a suboptimal job, and RDDs are significantly slower in Python than in Scala. The Spark DataFrames API addresses some of these problems, and DataFrames much faster, even in Scala; however, DataFrames aren't type-safe, and they're arguably less flexible.

Enter Datasets, a type-safe, object-oriented programming interface that works with the DataFrames API, provide some of the benefits of RDDs, and can be optimized via the Catalyst optimizer.

This talk will briefly recap RDDs and DataFrames, introduce the Datasets API, and then, through a live demonstration on Spark 2.0 Preview, compare all three against the same non-trivial data source.