Virtual Uber Data Platforms Meetup
Details
https://uber.zoom.us/j/95652311660?pwd=UXhqN2I3b1B2bUxXdlVuemhOSjRLUT09
Passcode: 053544
Join us for a virtual meetup about the most interesting platforms built at Uber to process and manage data at scale! There will be talks about how Hudi can provide ACID semantics to a data lake, Marmaray, a data movement platform built from the ground up at Uber, and Universal Workflow Orchestrator, a platform that includes a simple drag and drop interface that can manage the entire life cycle of a batch. You don’t want to miss this event!
Agenda:
11:00am -11:05am - Welcome
11:05am -11:30am- Building large scale, transactional data lakes with Apache Hudi - Nishith Agarwal
11:30am- 11:50am - Marmaray - Connecting any source to any sink - Yasaman Samei and Haijing Fu
11:50am -12:10pm -uWorc - No code workflow orchestrator for building batch & streaming pipelines at scale - Sriharsha Chintalapani
12:10pm -12:15pm - Closing remarks
Talks:
Building large scale, transactional data lakes with Apache Hudi - Nishith Agarwal
In this talk, we will discuss how Hudi can provide ACID semantics to a data lake. We will discuss some basic primitives such as upsert & delete required to achieve acceptable latencies in ingestion while providing high quality data by enforcing schematization on datasets. We will also discuss more advanced primitives such as restore, delta-pull, compaction & file sizing required for reliability, efficient storage management and to build incremental ETL pipelines. We will dig deeper into Hudi’s metadata model that allows for O(1) query planning as well as how it helps support Time-Travel queries to facilitate building feature stores for machine learning use-cases. Apache Hudi builds on open-source file formats; we will discuss how to easily onboard your existing dataset to Hudi format while keeping the same open-source formats so you can start utilizing all the features provided by Hudi without needing to make any drastic changes to your data lake
Speaker bio: Nishith manages the Data Lake team at Uber which helps build and grow Uber’s reliable and scalable big data platform that serves petabytes of data utilizing technologies such as Apache Hudi, Apache Hadoop, Apache Hive, Apache Kafka, Apache Spark, and Presto. Nishith is one of the initial engineers on the Uber's data team and is a PMC on the Apache Hudi project. He actively blogs about various challenges in large scale data storage and processing, standardized ingestion frameworks and associated challenges in production environments.
Marmaray - Connecting any source to any sink - Yasaman Samei and Haijing Fu
In this talk, we will present Marmaray, a data movement platform built from the ground up at Uber, designed to handle billion of records ingested into Uber's data lake as well as disperse data to various online sources. We will dive into Uber's data ingestion and dispersal technical stack followed by various architectural components of the platform.
Speaker bio: Yasaman Samei and Haijing Fu are Software Engineers on the Data Sharing Platforms Team and co-tech leads for Uber's data dispersal platform. The team's mission is to unify data-sharing solutions within and outside of Uber.
uWorc - No code workflow orchestrator for building batch & streaming pipelines at scale - Sriharsha Chintalapani
In this talk, we will talk about Universal Workflow Orchestrator, a platform that includes a simple drag and drop interface that can manage the entire life cycle of a batch or streaming pipeline, without having to write a single line of code.
Speaker bio: Sriharsha Chintalapani is the tech lead for Uber’s Data Product and Streaming platform. Data Workflows team provides a self-serve platform for thousands of engineers, data scientists and city ops to build data pipelines that scale for Uber needs.
https://uber.zoom.us/j/95652311660?pwd=UXhqN2I3b1B2bUxXdlVuemhOSjRLUT09
Passcode: 053544


