Spark MLlib: Making Practical Machine Learning Easy and Scalable


Details
This month we have Xiangrui Meng from Databricks presenting "Spark MLlib: Making Practical Machine Learning Easy and Scalable". Feel free to arrive around 6:40 for early networking and we'll start the talk at 7:05.
As always, we'll try to tape the event and the waitlist is automatic (so don't ask).
Abstract
MLlib is an Apache Spark component that focuses on large-scale machine learning (ML). With 50+ organizations and 110+ individuals contributing, MLlib is one of the most active open-source projects on ML. In this talk, we will share our experience in developing MLlib. The talk will cover both higher-level APIs, ML pipelines, that make MLlib easy to use, as well as lower-level optimizations that make MLlib scale to massive datasets.
ML workflows often involve a sequence of processing and learning stages. Realistic workflows are often even more complex, including cross-validation to choose parameters and combining multiple data sources. Inspired by scikit-learn, we proposed simple APIs to help users quickly assemble and tune ML pipelines. Under the hood, it seamlessly integrates with Spark SQL’s DataFrames and utilizes its data sources, flexible column operations, rich data types, as well as execution plan optimization to create efficient and scalable implementations.
There are many factors affecting a parallel implementation of an ML algorithm, e.g., optimization algorithm, platform limitation, communication pattern, data locality, numerical stability and performance, and fault-tolerance. Different implementations of the same ML algorithm can perform dramatically different. We will share lessons learned from optimizing the alternating least squares (ALS) implementation in MLlib.
Speaker Bio
Xiangrui Meng is committer on Apache Spark. He has been actively involved in the development of Spark MLlib and the new DataFrame API. Before working on Spark, he was an applied research engineer at LinkedIn, where he was the main developer of an offline machine learning framework in Hadoop MapReduce. Xiangrui holds a PhD in Computational Mathematics from Stanford University.

Spark MLlib: Making Practical Machine Learning Easy and Scalable