Skip to content

Details

Please register at our event partner AICamp for the zoom link:
https://www.aicamp.ai/event/eventdetails/W2022101917

Agenda

4:55 - 5:00pm Intro
5:00 - 5:45pm talk 1 + QA
5:45 - 6:30pm talk 2 + QA
6:30pm - close

Talk 1: Scaling Training and Batch Inference: A Deep Dive into Ray AIR's Data Processing Engine

Are you looking to scale your ML pipeline to multiple machines? Are you encountering an ingest bottleneck, preventing you from saturating your GPUs? This talk will cover how Ray AIR uses Ray Datasets for efficient data loading and preprocessing for both training and batch inference, diving into how AIR uses Datasets to achieve high performance and scalability.

We start by giving an overview of creating training and batch inference pipelines using Ray AIR. Next, we dive into the Ray Datasets internals, detailing features such as distributed data sharding, parallel + distributed I/O and transformations, pipelining of CPU and GPU compute, autoscaling pools of inference workers, and efficient per-epoch shuffling. Finally, we present case studies of users that have deployed such AIR workloads to production and have seen the performance + scalability benefits.You can learn more about Ray Datasets here: https://docs.ray.io/en/latest/data/dataset.html

Talk 2: Large-scale data shuffle in Ray with Exoshuffle

Shuffle is a key primitive in large-scale data processing applications. The difficulty of large-scale shuffle has inspired a myriad of implementations. While these have greatly improved shuffle performance and reliability over time, it comes at a cost: flexibility. We show that contrary to popular wisdom, shuffle can be implemented with high performance and reliability on a general-purpose system for distributed computing: Ray.

In this talk, we present Exoshuffle, an application-level shuffle system that outperforms Spark and achieves 82% of theoretical performance on a 100TB sort on 100 nodes. In Ray 2.0, we have integrated Exoshuffle with the Datasets library to provide high-performance large-scale shuffle for ML users.

Speakers:
Clark Zinzow is a software engineer at Anyscale, working on Ray's dataplane and ML ecosystem. He enjoys working on data-intensive distributed systems and scaling ML infrastructure.

Stephanie Wang is a PhD student in distributed systems at UC Berkeley, a software engineer at Anyscale, and a lead committer for the Ray project. Currently, she's working on problems such as fault tolerance and distributed memory management. She is generally interested in the problem of making general-purpose distributed programming possible and in designing fast and reliable distributed systems.

Jiajun Yao is a software engineer at Anyscale and a committer for the Ray project. He is interested in making distributed computing easily accessible to everyone. Before joining Anyscale, Jiajun was a software engineer at LinkedIn building the graph database.

Jules S. Damji is a lead developer advocate at Anyscale Inc, an MLflow contributor, and co-author of Learning Spark, 2nd Edition. He is a hands-on developer with over 25 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, @Home, Opsware/LoudCloud, VeriSign, ProQuest, Hortonworks, and Databricks, building large-scale distributed systems. He holds a B.Sc and M.Sc in computer science (from Oregon State University and Cal State,), and a MA in communication (JHU)).

Machine Learning
Distributed Systems
Parallel Programming
Big Data

Members are also interested in