Data pipelines at Facebook


Details
This talk by Facebook engineering tech lead Marian Olteanu will present the backend data infrastructure at Facebook, focused on pipeline development used for data transformation, learning, index building and analytics.
Short description:
Facebook’s multi-datacenter multi-cluster Hive infrastructure that hosts hundreds of petabytes of data is the backbone for data processing that powers data quality, inference, index building and analytics pipelines. The data is managed through a Python-based pipeline scheduler named Dataswarm, that lets engineers and data analysts write arbitrary pipelines. In this talk, Marian will talk about Dataswarm, tools to help manage the data while writing the pipelines, tools to monitor the pipelines and tools to test the pipelines prior to deployment in production.
About the speaker:
Marian lives in New York. For the past 20 months, he worked in Facebook’s places data quality team, working place deduplication and attribute completeness problems. He balances his time between ML&NLP problems (improving the accuracy of predictions) and engineering issues (performance, scaling, reliability). Prior to Facebook, he worked on Natural Language Processing problems (translation, question answering, information extraction).
7:00 PM - Pizza and Beer
7:30 PM - Talk begins

Data pipelines at Facebook