Skip to content

Healthcare Data Engineering & Reproducible Research w/ Tidyverse

Photo of
Hosted By
Bob W.


NOTE: This is our joint meet up with Data Science KC. If you'd like to attend, RSVP over at: The DS KC RSVP list is the OFFICIAL list of who is going.

We have two frighteningly good talks for the near ghoulesh evening! Ryan Brush, Cerner Distinguished Engineer, is going to talk about the hard data engineering work that powers an amazing new platform Cerner has built for healthcare data scientists and data analysts. The platform aims to give consistency and data access and structure for analysis. Coincidentally, Bob Wakefield will be talking about reproducible research in the data science space and how Tidyverse can help you get started on the right foot! Don't be caught saying "It worked in dev!" Bob will show us how to avoid common pitfalls. Healthcare Data Engineering and the Public Cloud

We start with petabytes of noisy, conflicting, incomplete and complicated healthcare data, and aim for meaningful features for machine learning and other analysis. This talk looks at Data Engineering techniques to first make sense of complex data, how we have grown cloud-based architecture to support it, and how we quickly adapt that system for new needs. Ultimately this lands in a scalable user experience powered by Apache Spark, along with a set of feature engineering patterns and a set domain-specific helper functions.

Reproducible Research with R, The Tidyverse, Notebooks, and Spark

Many of us data science and business analytics practitioners perform research and analysis for decision makers on a regular basis. The deliverable of such analysis often results in a Power Point presentation, and/or a model that needs to be productionalized. The code used to produce the analysis also needs to be considered a deliverable.

Many of us perform analysis without reproducibility in mind. With the increasing democratization of data, it is becoming more and more important for people that may not have scientific training to be able to create analysis that can be picked up by somebody else who can then reproduce your results. That, and creating reproducible research is just solid science.

We are going to spend an evening walking though the various tools available to create reproducible research on Big Data. You will get introduced to the Tidyverse of R packages and how to use them. We will discuss the ins and outs of various notebook technologies like Jupyter, and Zeppelin. You will have an opportunity to learn how to get up and running with R and Spark and the various options you have to learn on real clusters instead of just your local environment. There also be a quick introduction to source control and the various options you have around using Git.

The theme of the evening will be “getting started”. We will go over various training resources and show you the optimal path to go from zero to master. Some commentary will be provided around the current state of the job market and intel from the front lines of the data science language wars. This is a large topic and the evening will be fairly dynamic and responsive to the needs of the audience.

Bob Wakefield has spent the better part of 16 years building data systems for many organizations across various industries. He has been running Hadoop in a lab environment for 3 years. He is the principal of Mass Street Analytics, LLC a boutique data consultancy. Mass Street is a Hortonworks Consultant Partner and Confluent Partner.

In his spare time, he likes to work on an equity investment application that combines various sources of information to automatically arrive at investing decisions. When he is not doing that, you’ll find him flying his A-10 simulator.

4210 Shawnee Mission Parkway · Fairway, KS