Spark DataFrames and ML Pipelines for Large-scale Data Science


Details
We're excited to have Reynold Xin and Xiangrui Meng from Databricks speak about recent developments in Spark: what's coming and what design choices were made along the way.
Abstract
Data frames in R and Python have become the de facto standards for data science. However, when it comes to Big Data, neither R nor Python data frames integrate well with the Big Data toolings and can scale up to large datasets. In this talk, we will introduce two latest efforts in Spark to scale up data science: DataFrames and machine learning pipelines.
Inspired by R and Pandas, DataFrame in Spark provides concise, powerful programmatic interfaces designed for structured data manipulation. In particular, it features:
- Ability to scale from kilobytes of data on a single laptop to petabytes on a large cluster
- Support for a wide array of data formats and storage systems
- State-of-the-art optimization and code generation through the Spark SQL Catalyst optimizer
- Seamless integration with all big data tooling and infrastructure via Spark
- APIs for Python, Java, Scala, and R (in development via SparkR)
On top of DataFrames, we have built a new machine learning (ML) pipeline API. ML workflows often involve a sequence of processing and learning stages. For example, classifying text documents might involve cleaning the text, transforming raw text into feature vectors, and training a classification model. Realistic workflows are often even more complex, including cross-validation to choose parameters and combining multiple data sources. With most current tools for ML, it is difficult to set up practical pipelines. Inspired by scikit-learn, we proposed simple APIs to help users quickly assemble and tune practical ML pipelines. Under the hood, it seamlessly integrates with Spark SQL’s DataFrames and utilizes its data sources, flexible column operations, rich data types, as well as execution plan optimization to create efficient and scalable implementations.
Schedule
6:30-7:15 Social (Food and drinks served)
7:15-8:15 Talk and Questions
8:15-8:45 Social
Bios
Xiangrui Meng is committer on Apache Spark. He has been actively involved in the development of Spark MLlib and the new DataFrame API. Before working on Spark, he was an applied research engineer at LinkedIn, where he was the main developer of an offline machine learning framework in Hadoop MapReduce. Xiangrui holds a PhD in Computational Mathematics from Stanford University.
Reynold Xin is a committer on Apache Spark and a co-founder of Databricks. Before Databricks, he was pursuing a PhD at UC Berkeley AMPLab. He holds the current world time record in sorting 100TB of data, and wrote the two highest cited papers in SIGMOD 2013 and SIGMOD 2011.
Sponsor
Thank you to our sponsor Wix.com (http://dev.wix.com/?utm_source=datascience&utm_medium=datascience&utm_campaign=sflounge&experiment_id=datascience) for hosting us at the Wix Lounge and Databricks (https://databricks.com/) for sponsoring the food and drinks.

Spark DataFrames and ML Pipelines for Large-scale Data Science