This talk will provide a technical overview of Spark’s DataFrame API in the context of data science, from exploratory data analysis to ETL to machine learning. We will review the API with a demo using a real-world dataset, covering data input/output, summary statistics, missing data handling, and statistical functions. We will then dive into the internals of DataFrame implementations, followed by how we view DataFrame in the long-term Spark roadmap and ecosystem.
Reynold Xin is a cofounder of Databricks and a committer on Apache Spark, driving the design of Spark's next-gen API and execution engine. He holds the current world record in 100TB sorting (Daytona GraySort), beating the previous record by a factor of 3. On leave from his PhD at the UC Berkeley AMPLab, he also wrote the highest cited papers in SIGMOD 2011 and SIGMOD 2013.