Skip to content

Details

Apache Hudi is a data platform technology that helps to build reliable and scalable data lakes. Hudi brings stream processing to big data, supercharging your data lakes making them orders of magnitude efficient. Hudi is widely used in Uber and other companies to build transactional data lakes.

Please join us for a virtual meetup hosted by Uber and the Apache Hudi community. We will kick off with an update on Apache Hudi 0.8.0 release followed by interesting talks by the speakers from Uber, Robinhood and Logical Clocks.

12:00pm - 12:15pm - Welcome + Tech Talk - Nishith (Uber)
12:15pm - 12:40pm - Tech Talk - Moritz Meister (Logical Clocks)
12:40pm - 01:05pm - Tech Talk - Robinhood (Balaji Varadarajan, Vikrant Goel and Josh Kang)

More information:

HUDI as a feature store- Moritz Meister (Logical Clocks)
The feature store is a platform that stores features for machine learning (ML) and provides consistent, secure access to them for both training and serving models. A Feature Store is generally a dual-database system with a low-latency key-value online feature store for serving single feature vectors, and a columnar offline feature store for batch processing and historical feature extraction. In this talk, we concentrate on the offline feature store and motivate our decision to support Apache Hudi with copy-on-write tables as the default offline feature store in the open-source Hopsworks platform. An offline feature store has many similarities with a modern data lake: it needs scalable storage of data, incremental ingestion of data, ACID guarantees for updates to data, and support for Spark in data pipelines. We will present our feature store API (the hsfs library) and how it simplifies Hudi table operations using PySpark. Our feature store also extends Hudi tables with metadata, such as feature statistics, access control, and a UI to visualize commits to Hudi tables over time.
-Moritz Meister is one of the lead engineers at Logical Clocks developing the Hopsworks Feature Store. His focus is on user-facing APIs and integrations. He has previously worked as a data scientist on projects for Deutsche Telekom and Deutsche Lufthansa in Germany, helping to productionize machine learning models to improve customer relationship management.

Hudi@Robinhood- (Balaji Varadarajan, Vikrant Goel and Josh Kang)
In this talk, we will talk about Robinhood’s RDS data lake evolution. We will start with talking about prior snapshot based architecture and the pain points associated with this. After that, we will introduce the new CDC based architecture that we built on top of Apache Hudi and other existing open source technologies. We will provide general technical challenges associated with operating the CDC pipelines at scale and the steps we took to alleviate the problems. We will end with the new challenges we are working on.
-Balaji Varadarajan is a data architect at Robinhood where he broadly overseas Robinhood’s data lake. He is also an Apache Hudi PMC. Previously, he was a tech lead in Uber data ingestion team and one of the lead engineers on LinkedIn’s databus change capture system. Balaji’s interests lie in distributed data systems.
-Vikrant is a lead engineer for Robinhood's Data Platform responsible for CDC and database data ingestion to Data Lake. Previously, he was a tech lead in the WalmartLabs's algorithmic pricing team and an engineer in the Oracle's Fusion Middleware group. He likes to play with data.
-Josh Kang is a software engineer at Robinhood. His work at Robinhood mainly focuses on Data Ingestion with CDC. Previously, he has also interned at LinkedIn and Microsoft.

Hudi@Uber - Nishith Agarwal (Uber)
In this talk, we will talk about the latest Apache Hudi 0.8.0 release, discuss the new features and what’s upcoming. Nishith manages the Data Lake team@Uber and is an Apache Hudi PMC.

https://uber.zoom.us/j/98515688949?pwd=SmZWbk9ILzFsc3V4K2hXdUFkbWZOQT09
Passcode: 695601

Sponsors

Uber

Uber

Reliable transportation everywhere, for everyone.

You may also like