Open Source Data Processing Engine for TensorFlow at LinkedIn

This is a past event

62 people went

Needs a location


This event is online, one of our weekly online AI tech talks series. You can listen, watch, Q&A with speakers from anywhere around the world. Miss the live session due to time zone or conflict? you can still sign up to watch session replay at any time:


To effectively support deep learning at LinkedIn, we need to first address the data processing issues. Most of the datasets used by our ML algorithms (e.g., LinkedIn large scale personalization engine Photon-ML) are in Avro format. Each record in an Avro dataset is essentially a sparse vector, and can be easily consumed by most of the modern classifiers. However, the format cannot be directly used by TensorFlow -- the leading deep learning package. The main blocker is that the sparse vector is not in the same format as Tensor.

Many companies have vast amount of ML data in similar sparse vector format, and Tensor format is still relatively new to many companies. Avro2TF bridges this gap by providing scalable Spark based transformation and extension mechanism to efficiently convert the data into TF records that can be readily consumed by TensorFlow. With this technology, engineers can improve their productivity by focusing on model building rather than data processing.

In this talk, we will go over the data processing issues common to many machine learning pipelines, and how we solve the problems, then deep dive into the open sourced tool, Avro2TF. How it works, its tech architecture and usage.

Speaker: Xuhong Zhang, Senior Software Engineer at LinkedIn

Online AI tech talks, courses, bootcamps :