Skip to content

DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks

Photo of Subash DSouza
Hosted By
Subash D.
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks

Details

Abstract:

This talk will provide a technical overview of Spark’s DataFrame API in the context of data science, from exploratory data analysis to ETL to machine learning. We will review the API with a demo using a real-world dataset, covering data input/output, summary statistics, missing data handling, and statistical functions. We will then dive into the internals of DataFrame implementations, followed by how we view DataFrame in the long-term Spark roadmap and ecosystem.

Bio:
Reynold Xin is a cofounder of Databricks and a committer on Apache Spark, driving the design of Spark's next-gen API and execution engine. He holds the current world record in 100TB sorting (Daytona GraySort), beating the previous record by a factor of 3. On leave from his PhD at the UC Berkeley AMPLab, he also wrote the highest cited papers in SIGMOD 2011 and SIGMOD 2013.

Photo of Los Angeles Apache Spark Users Group group
Los Angeles Apache Spark Users Group
See more events
Internap Data Center - LAX014
3690 Redondo Beach Ave · Redondo Beach, CA