Distributed Time Travel for Feature Generation at Netflix

Name: Distributed Time Travel for Feature Generation at Netflix
Start: 2016-03-24T18:00:00-07:00
End: 2016-03-24T21:00:00-07:00
Location: Yelp

Hosted By

Chester C.

Distributed Time Travel for Feature Generation at Netflix

Details

Learning is an analytic process of exploring the past in order to predict the future. Hence, being able to travel back in time to create feature is critical for machine learning projects to be successful. At Netflix, we spend significant time and effort experimenting with new features and new ways of building models. This involves generating features for our members from different regions over multiple days. To enable this, we built a time machine using Apache Spark that computes features for any arbitrary time in the recent past. The first step of building this time machine is to snapshot the data from various micro services on a regular basis. We built a general purpose workflow orchestration and scheduling framework optimized for machine learning pipelines and used it to run the snapshot and model training workflows. Snapshot data is then consumed by feature encoders to compute various features for offline experimentation and model training. Crucially, the same feature encoders are used in both offline model building and online scoring for production or A/B tests. Building this time machine helped us try new ideas quickly without placing stress on production services and without having to wait for data accumulation of the newly-implemented features. Moreover, building it with Apache Spark empowered us to both scale up the data size by an order of magnitude and train and validate the models in less time. Finally, using Apache Zeppelin notebook, we are able to interactively prototype features and run experiments.

Speaker Bio: DB Tsai

DB Tsai is an Apache Spark committer and a Senior Research Engineer working on Personalized Recommendation Algorithms at Netflix. Prior to joining Netflix, DB was a Lead Machine Learning Engineer at Alpine Data Labs, where He implemented several algorithms including Linear Regression and Binary/Multinomial Logistic Regression with Elastici-Net (L1/L2) regularization using LBFGS/OWL-QN optimizers in Apache Spark. DB was a Ph.D. candidate in Applied Physics at Stanford University. He holds a Master’s degree in Electrical Engineering from Stanford University.

Agenda,
6 - 6:40 pm -- networking + light dinner
6:40 pm --6:45 pm introduction + announcement
6:45 pm -- 8:00 pm main talk + QA
8:00 pm -- 8:30 pm closing
8:45 pm office closed. Important:
at 7:00 pm -- entrance closed ( Yelp security no longer allows member to come-in after 7 pm)

Events in San Francisco, CA

SF Big Analytics

See more events

SF Big Analytics

Thursday, March 24, 2016
6:00 PM to 9:00 PM PDT

Yelp

140 New Montgomery · San Francisco, CA

SF Big Analytics

public group

Distributed Time Travel for Feature Generation at Netflix