Skip to content

Scale@Uber/Lyft: Managing data lake, workflow & Spark on Kubernetes

Photo of Chester Chen
Hosted By
Chester C.
Scale@Uber/Lyft:  Managing data lake, workflow & Spark on Kubernetes

Details

Agenda:
6 -- 6:30 pm check-in & light food+ Networking
6:35 -- 6:40 pm Intro
6:40 -- 7:15 pm Talk 1 (Lyft)
7:15 - 7:50 pm talk 2 (Uber)
7:50 - 8:25 PM talk 3 (Uber)

Talk 1. Scaling Apache Spark on Kubernetes at Lyft
As part of this mission Lyft invests heavily in open source infrastructure and tooling. At Lyft Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark at Lyft has evolved to solve both Machine Learning and large scale ETL workloads. By combining the flexibility of Kubernetes with the data processing power of Apache Spark, Lyft is able to drive ETL data processing to a different level. In this talk, We will talk about challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Topics Include: - Key traits of Apache Spark on Kubernetes. - Deep dive into Lyft's multi-cluster setup and operationality to handle petabytes of production data. - How Lyft extends and enhances Apache Spark to support capabilities such as Spark pod life cycle metrics and state management, resource prioritization, and queuing and throttling. - Dynamic job scale estimation and runtime dynamic job configuration. - How Lyft powers internal Data Scientists, Business Analysts, and Data Engineers via a multi-cluster setup.

Speaker: Li Gao
Li Gao is the tech lead in the cloud native spark compute initiative at Lyft. Prior to Lyft, Li worked at Salesforce, Fitbit, Marin Software, and a few startups etc. on various technical leadership positions on cloud native and hybrid cloud data platforms at scale. Besides Spark, Li has scaled and productionized other open source projects, such as Presto, Apache HBase, Apache Phoenix, Apache Kafka, Apache Airflow, Apache Hive, and Apache Cassandra.

Talk 2. Managing Uber’s Data workflow at Scale.
Uber microservices serving millions of rides a day, leading to 100+ PB of data. To democratize data pipelines, Uber needed a central tool that provides a way to author, manage, schedule, and deploy data workflows at scale. This talk details Uber’s journey toward a unified and scalable data workflow system used to manage this data and shares the challenges faced and how the company has rearchitected several components of the system—such as scheduling and serialization—to make them highly available and more scalable.

Speaker Alex Kira (Uber)
Alex Kira is an engineering tech lead at Uber, where he works on the data workflow management team. His team provides a data infrastructure platform. In 19-year, he’s had experience across several software disciplines, including distributed systems, data infrastructure, and full stack development.

Talk 3: Building highly efficient data lakes using Apache Hudi (Incubating)

Even with the exponential growth in data volumes, ingesting/storing/managing big data remains unstandardized & in-efficient. Data lakes are a common architectural pattern to organize big data and democratize access to the organization. In this talk, we will discuss different aspects of building honest data lake architectures, pin pointing technical challenges and areas of inefficiency. We will then re-architect the data lake using Apache Hudi (Incubating), which provides streaming primitives right on top of big data. We will show how upserts & incremental change streams provided by Hudi help optimize data ingestion and ETL processing. Further, Apache Hudi manages growth, sizes files of the resulting data lake using purely open-source file formats, also providing for optimized query performance & file system listing. We will also provide hands-on tools and guides for trying this out on your own data lake.
Speaker: Vinoth Chandar (Uber)
Vinoth is Technical Lead at Uber Data Infrastructure Team

Photo of SF Big Analytics group
SF Big Analytics
See more events
GoPro Headquarters Building D
3000 Clearview Way Building D · San Mateo, CA