Speaker: Daniel Whitenack, Data Scientist, Lead Developer Advocate at Pachyderm, Inc.
Despite the many amazing applications of statistics and machine learning in industry, many attempts at doing "data science" are anything but reproducible. This can be particularly alarming in industries that require processes to be audited or in light of recent government regulations giving users a "right to an explanation" for algorithmic decisions. In this session, I will discuss the importance of reproducibility and data provenance in any data science organization, and I will provide some practical steps to help scientists build reproducible data analyses and maintain integrity in their data science applications. I will also demo a reproducible data science workflow that includes complete provenance explaining the entire process that produced specific results.
Daniel (@dwhitena) is a Ph.D. trained data scientist working with Pachyderm (@pachydermIO).Daniel develops innovative, distributed data pipelines which include predictive models, data visualizations, statistical analyses, and more. He has spoken at conferences around the world (Datapalooza, DevFest Siberia, GopherCon, and more), teaches data science/engineering with Ardan Labs (@ardanlabs), maintains the Go kernel for Jupyter, and is actively helping to organize contributions to various open source data science projects.