Skip to content

Blazing Fast Data Load and Transform with R's data.table and Pentaho's PDI

Photo of Ryan B. Harvey
Hosted By
Ryan B. H.
Blazing Fast Data Load and Transform with R's data.table and Pentaho's PDI

Details

What You'll Hear About

R is a reliable and versatile tool for data munging. Unfortunately, base R data load and transform processes are slow for larger data sets.

To alleviate this problem, the contributors to the data.table package in R have rewritten many data flow tools in C, with dramatic speed gains. Moreover, data.table can often do with one line what base-R users would require a page of code for. This has come at the cost of a rather demanding coding syntax.

During this talk, I will try to partially demystify data.table by going over a limited basic set of data.table operations, benchmarking against base-R as I go. For even larger data.sets, using R for ETL becomes unwieldly, so, time allowing, I will also attempt to demonstrate a few basic uses of Pentaho's PDI toolset.

About Our Speaker

Serban Tanasa is a managing director of Sunstone Science, a new innovative Business Intelligence and Analytics startup in the DC Metro Area.

A migrant from the world of academia and international development, he is also serving as the director of analytics and research at CSBS, where he is currently building a BI solution from the ground up.

Serban's hard at work bridging the gap between traditional Business Intelligence and the new wave of analytics-heavy methods surrounding Big Data.

Photo of Data Engineers DC group
Data Engineers DC
See more events
GWU, Funger Hall, Room 103
2201 G St. NW · Washington, DC