R as a Data Transformation Pipeline
Details
By: Ryan Price
The use cases for ETL (extract, transform, load) have rapidly evolved along with the needs of businesses and researchers in the twenty-plus years since the Kimball Group first released The Data Warehouse Toolkit and popularized the need for well-designed ETL processes. The demand for flexibility and expressiveness in data transformation systems has outpaced many SQL-based RDBMS. Communities around tools like R, Python, and Julia have rapidly developed robust solutions to these problems.
R often carries a reputation for being a language and software purely for statistical analysis. While loosely true, this also makes R a fantastic native candidate for a tabular data transformation engine. "ETL" need not carry the legacy connotation of "force everything into a star schema, in a data warehouse, using SQL". You can use R and its fantastic library of packages to design robust, production-ready data pipelines leading to and from nearly any system, easily applying any transformation logic in between.
This presentation will be a demonstration of how to design R packages as data transformation and pipeline systems, and how to incorporate business logic into them. We will walk through defining functions that read, clean, manipulate, and write out disparate data sets, and then compose those functions into "main" production scripts.
I will also show you how to wrap your functions around calls to `loggit()`, a JSON logging package I wrote to capture failures in data and business logic validation. The transformations I will be demonstrating will likely rely on the tidyverse suite of packages, but you can just as well design these pipelines using any libraries of your choosing (data.table, base/stats, etc).
The meetup will be held in the CIC building at 20 South Sarah Street, St. Louis, MO 63108. We will be in the Showroom. The building entrance is at the corner of Forest Park Avenue and Sarah Street. You can find directions at http://stl.cic.us/directions/ (CIC@CET).
We will meet for snacks, set-up, and conversation at 6:00PM. The presentation will start at 6:30PM and will be about 60 minutes long.
