Containerizing Data Workflow and Testing Data Pipelines


Details
Schedule:
6:00 - Doors & Food
6:30 - Talk 1
7:15 - Talk 2
7:45 - Wrap & Chat
Talk 1: Pros and cons of containerizing data workflows (and how to have the best of both worlds)
Speaker: Tian Xie, Data Engineer @ Enigma
Abstract:
At Enigma, we run over one hundred workflows to ingest public data into our system. Running so many workflows also means managing dependencies and deployment for each of those workflows. Over time, we have iterated over several solutions to this problem and this is our story. Spoiler: docker is involved, but (plot twist) it only leads to another set of problems in the 2nd act.
Bio:
Tian Xie has been working in the NYC tech start-up scene for the last eight years on consumer video rendering, on-demand shipping, and now data engineering at Enigma Technologies.
Talk 2: Building a Data Pipeline with Testing in Mind
Speaker: Jiaqi Liu, Software Engineer @ Button, Inc
Abstract:
It’s one thing to build a robust data pipeline process in python but a whole other challenge to find tooling and build out the framework that allows for testing a data process. In order to truly iterate and develop a codebase, one has to be able to confidently test during the development process and monitor the production system.
In this talk, I hope to address the key components for building out end to end testing for data pipelines by borrowing concepts from how we test python web services. Just like how we want to check for healthy status codes from our API responses, we want to be able to check that a pipeline is working as expected given the correct inputs. We’ll talk about key features that allows a data pipeline to be easily testable and how to identify timeseries metrics that can be used to monitor the health of a data pipeline.


Containerizing Data Workflow and Testing Data Pipelines