DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks

Name: DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
Start: 2015-07-09T19:00:00-07:00
End: 2015-07-09T22:00:00-07:00
Location: Internap Data Center - LAX014

Hosted by Subash D.

Los Angeles Apache Spark Users Group

Details

Abstract:

This talk will provide a technical overview of Spark’s DataFrame API in the context of data science, from exploratory data analysis to ETL to machine learning. We will review the API with a demo using a real-world dataset, covering data input/output, summary statistics, missing data handling, and statistical functions. We will then dive into the internals of DataFrame implementations, followed by how we view DataFrame in the long-term Spark roadmap and ecosystem.

Bio:
Reynold Xin is a cofounder of Databricks and a committer on Apache Spark, driving the design of Spark's next-gen API and execution engine. He holds the current world record in 100TB sorting (Daytona GraySort), beating the previous record by a factor of 3. On leave from his PhD at the UC Berkeley AMPLab, he also wrote the highest cited papers in SIGMOD 2011 and SIGMOD 2013.

Los Angeles Apache Spark Users Group

DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks

Los Angeles Apache Spark Users Group

Details

Related topics

You may also like