Skip to content

Spark DataFrames for Large-Scale Data Science

Photo of Scott Walent
Hosted By
Scott W. and 2 others
Spark DataFrames for Large-Scale Data Science

Details

Data frames in R and Python have become the de facto standards for data science. However, when it comes to Big Data, neither R nor Python data frames integrate well with the Big Data toolings and can scale up to large datasets.

In this talk, Reynold Xin from Databricks will introduce the new DataFrame abstraction in Spark for large-scale data science. Inspired by R and Pandas, DataFrame provides concise, powerful programmatic interfaces designed for structured data manipulation. In particular, when compared with traditional data frame implementations, it enables:

  • Scaling from kilobytes to petabytes of data

  • Reading structured datasets (JSON, Parquet, CSV, relational tables, ...)

  • Machine learning integration

  • Cross-language support for Java, Scala, and Python

Internally, the DataFrame API builds on Spark SQL's query optimization and query processing capabilities for efficient execution. Data scientists and engineers can use this API to more elegantly express common operations in data analytics. It makes Spark more accessible to a broader range of users and improve optimizations for existing ones.

The talks will be livestreamed, and the video will be published on the Apache Spark channel (https://www.youtube.com/user/TheApacheSpark) on YouTube.

Also, a badge for Strata + Hadoop World (http://strataconf.com/big-data-conference-ca-2015) is NOT required to attend the meetup.

Agenda

UPDATE[1]: Due to speaker adjustments, our presentation will be 7-8pm (was 7-9pm)

UPDATE[2]: Due to a problem with the venue, we will not be providing food tonight, sorry!

6:00pm - 7:00pm - Drinks, food, mingle

7:00pm - 8:00pm - Presentations

8:00pm and after - More mingling

Photo of Bay Area Spark Meetup group
Bay Area Spark Meetup
See more events
San Jose Convention Center, Room 210B/F
150 W San Carlos · San Jose, CA