Skip to content

Spark DataFrames and ML Pipelines for Large-Scale Data Science

J
Hosted By
John M.
Spark DataFrames and ML Pipelines for Large-Scale Data Science

Details

We are pleased to welcome Reynold Xin and Joseph Bradley from Databricks (https://databricks.com/) to the NYC Data Science meetup. Reynold and Joseph will be speaking about two new tools for doing data science in Apache Spark, one of today's most exciting data technologies.

Food and drink will be provided by eBay, our hosts for the evening.

NOTE: The meeting room has a maximum capacity of 72 people. We have set the RSVP limit higher to accommodate some number of no-shows, but we will have to turn people away if we reach capacity.

Abstract: Data frames in R and Python have become the de facto standards for data science. However, when it comes to Big Data, neither R nor Python data frames integrate well with big data toolings and can scale up to large datasets. In this talk, we will introduce the two latest efforts in Spark to scale up data science: DataFrames and machine learning pipelines.

Inspired by R and Pandas, DataFrame in Spark provides concise, powerful programmatic interfaces designed for structured data manipulation. In particular, it features:

• Ability to scale from kilobytes of data on a single laptop to petabytes on a large cluster
• Support for a wide array of data formats and storage systems
• State-of-the-art optimization and code generation through the Spark SQL Catalyst optimizer
• Seamless integration with all big data tooling and infrastructure via Spark
• APIs for Python, Java, Scala, and R (in development via SparkR)

On top of DataFrames, we have built a new machine learning (ML) pipeline API inspired by the similarly named concept in scikit-learn. ML pipelines enable users to express ML workflows as a sequence of processing and learning stages. For example, classifying text documents might involve cleaning the text, transforming raw text into feature vectors, and training a classification model.

Speakers: Reynold Xin is a committer on Apache Spark and a co-founder of Databricks. Before Databricks, he was pursuing a PhD at UC Berkeley AMPLab. He holds the current world time record in sorting 100TB of data, and wrote the two highest cited papers in SIGMOD 2013 and SIGMOD 2011.

Joseph Bradley is a Software Engineer at Databricks, working on Spark MLlib. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon University in 2013. His research included probabilistic graphical models, parallel sparse regression, and aggregation mechanisms for peer grading in MOOCs.

Reynold and Joseph will be in NYC for Spark Summit (http://spark-summit.org/east/2015/agenda). Members of the meetup group can get 20% off registration by using code "NYC-DATA-SCI".

Photo of NYC Data Science group
NYC Data Science
See more events
NYC Data Science
Photo of NYC Data Science group
No ratings yet
eBay NYC
625 6th Avenue, Floor 3 · New York, NY