Brian Clapper: RDDs, DataFrames and Datasets in Apache Spark

Name: Brian Clapper: RDDs, DataFrames and Datasets in Apache Spark
Start: 2016-06-16T18:30:00-04:00
End: 2016-06-16T21:30:00-04:00
Location: Azavea

Hosted By

Brian C.

Details

Brian will reprise his Northeast Scala Symposium talk, for those who missed it—with a little new content related to Spark 2.0.

Azavea has graciously agreed, once again, to host this meetup in their awesome new HQ at 990 Spring Garden Street.

Vistar Media (http://www.vistarmedia.com/about) will be sponsoring the pizza and drinks.

Talk abstract:

Traditionally, Apache Spark jobs have been written using Resilient Distributed Datasets (RDDs), a Scala Collections-like API. RDDs are type-safe, but they can be problematic: It's easy to write a suboptimal job, and RDDs are significantly slower in Python than in Scala. The Spark DataFrames API addresses some of these problems, and DataFrames much faster, even in Scala; however, DataFrames aren't type-safe, and they're arguably less flexible.

Enter Datasets, a type-safe, object-oriented programming interface that works with the DataFrames API, provide some of the benefits of RDDs, and can be optimized via the Catalyst optimizer.

This talk will briefly recap RDDs and DataFrames, introduce the Datasets API, and then, through a live demonstration on Spark 2.0 Preview, compare all three against the same non-trivial data source.

Events in Philadelphia, PA

PHASE: Philly Area Scala Enthusiasts

See more events

PHASE: Philly Area Scala Enthusiasts

public group

Thursday, June 16, 2016 at 6:30 PM to Thursday, June 16, 2016 at 9:30 PM EDT

Azavea

990 Spring Garden Street · Philadelphia, PA

PHASE: Philly Area Scala Enthusiasts

public group

Brian Clapper: RDDs, DataFrames and Datasets in Apache Spark