Skip to content

Industrializing Data Science, Beating Pipeline Debt and Infrastructure for DS

Photo of Miguel A. Alvarado
Hosted By
Miguel A. A.
Industrializing Data Science, Beating Pipeline Debt and Infrastructure for DS

Details

Industrializing Data Science, Analytics and ML via a single platform is the holy grail for data teams. Delivering a data platform that allows users to easily interact with data, do ad-hoc analysis, build pipelines, train ML models, and more is not an easy task.

Delivering such a platform requires two key ingredients:

  1. Robust checks and balances for Data Quality
  2. A flexible, well engineered infrastructure

Industrializing Data Science is all about automating and hiding all the things that are tedious and hard, so that things like data analysis, data pipelines and ML training is easy and repeatable. This allows data teams to focus on the tasks that yield the most value for the products and experiences that they’re trying to deliver. It opens the door for innovation.

This session covers both of these topics with talks by two guest speakers, Abe Gong (Great Expectations) and Neelesh Salian (Stitch Fix).

The session will be split into the following 2 talks...

Title: Beating Pipeline Debt with Great Expectations
Presenter: Abe Going

Abstract:
Data science and engineering have been missing out on one of the biggest productivity boosters in modern software development: automated testing. Without automated tests, data pipelines often become deep stacks of unverified assumptions. Mysterious (and sometimes embarrassing) bugs crop up more and more frequently, and resolving them requires painstaking exploration of upstream data, often leading to frustrating negotiations about data specs across teams.

Great Expectations is an open source Python framework for bringing data pipelines and products under test. Like assertions in traditional Python unit tests, expectations provide a flexible, declarative language for describing expected behavior. Unlike traditional unit tests, Great Expectations applies expectations to data instead of code. Great Expectations makes it easy to set up your testing framework early, capture findings while they’re still fresh, and systematically validate new data against them. It’s the best tool for managing the complexity that inevitably grows within data pipelines.

Title: A compute infrastructure for data scientists
Presenter: Neelesh Salian

Abstract:
Stitch Fix aspires to help you find the style that you will love. Data, the backbone of the business, is used to help with styling recommendations, demand modeling, user acquisition, and merchandise planning and also to influence business decisions throughout the organization. These decisions are backed by algorithms and data collected and interpreted based on client preferences. This talk offers an overview of the compute infrastructure used by the data science team at Stitch Fix, covering the architecture, tools within the larger ecosystem, and the challenges that the team overcame along the way.

Apache Spark plays an important role in Stitch Fix’s data platform, and the company’s data scientists use Spark for their ETL and Presto for their ad hoc queries. The goal for the team running the compute infrastructure is to understand and make the data scientists’ lives easier, particularly in terms of usability of Spark, by building tools that make it easier to get started with Spark and transition themselves to a daily workflow. The compute infrastructure is a part of the data platform that is responsible for all the needs of data scientists as Stitch Fix.

In this talk, we look at Stitch Fix’s journey, exploring its Spark setup, in-house tools and how they work in synergy with open source frameworks in a cloud environment. There are additional improvements to the infrastructure that help persist information for future use and optimization and we look at how the implementation of Amazon’s EMR FS has helped make it easier for us to read from the S3 source.

Photo of All-Things-Data Meetup group
All-Things-Data Meetup
See more events
Online - in real time
44 Tehama St, SF, CA 94105 · San Francisco, CA