Skip to content

Distributed ML in Spark

Photo of Matthew Hunt
Hosted By
Matthew H.
Distributed ML in Spark

Details

Pizza and drinks will be provided, courtesy of Bloomberg.

Schedule

6PM doors open and food available
6:15 - 7:15 Presentation
7:15 - 8 Mingling and discussion

Title: Distributed ML in Apache Spark

Abstract

From a laptop to a cluster. From R&D to production. From data science to engineering teams. These leaps often form difficult---and expensive---barriers when applying Machine Learning to modern datasets and deploying ML in real-world applications. This talk will discuss Apache Spark’s MLlib library for large-scale ML, focusing on how we simplified elements of production-grade ML by building MLlib on top of Spark’s distributed DataFrame/Dataset API.

We will cover these key areas:

Data sources: Spark DataFrames provide a common API for many data sources. The same API can be used for both local “ML” formats like LibSVM and for distributed formats like JSON, Parquet, Avro, and others.

Pipelines: DataFrames provide a natural API for ML workflows, which MLlib represents via a “Pipeline” abstraction. Data flows through an ML Pipeline, is augmented via transformations and model predictions, and can be materialized efficiently as needed via lazy computation.

Under the hood: Spark DataFrames support both local and distributed execution, allowing local development and distributed deployment of the same code. Prediction also benefits from DataFrame optimizations, allowing us to separate non-ML optimizations from the ML library.

Model persistence: MLlib supports saving and loading models using DataFrames’ efficient columnar storage, including distributed models too large to collect to single disks. This includes not only single models but also entire Pipelines.

4 languages: MLlib offers APIs in Python, Scala, Java, and R. Under the hood, a single data representation using DataFrames prevents code duplication and reduces expensive data serialization.

This talk will be accessible to those not familiar with Spark and MLlib but will also include details useful to experienced Spark users. We will focus on the upcoming Spark 2.0 release but will also mention the future roadmap.

Bio

Joseph Bradley is a Software Engineer and Apache Spark PMC member working on MLlib at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon U. in 2013. His research included probabilistic graphical models, parallel sparse regression, and aggregation mechanisms for peer grading in MOOCs.

Photo of Spark-NYC group
Spark-NYC
See more events
Civic Hall
156 5th Avenue,2nd floor · New York, NY