Spark DataFrames


Details
Data frames in R and Python have become the de facto standards for data science. When it comes to Big Data however, neither R data frames nor Python data frames integrate well with Big Data tooling to scale up for use on large datasets. Inspired by R and Pandas, Spark's DataFrame provides concise, powerful programmatic interfaces designed for structured data manipulation. In particular, when compared with traditional data frame implementations, it enables:
• Scaling from kilobytes to petabytes of data
• Reading structured datasets (JSON, Parquet, CSV, relational tables)
• Machine learning integration
• Cross-language support for Java, Scala, and Python
Internally, the DataFrame API builds on Spark SQL's query optimization and query processing capabilities for efficient execution. Data scientists and engineers can use this API to more elegantly express common operations in data analytics. It makes Spark more accessible to a broader range of users and improve optimizations for existing ones.
About the Speaker:Michael Armbrust was the initial contributor of Spark SQL and now leads development of the project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.

Spark DataFrames