Blazing Fast Data Load and Transform with R's data.table and Pentaho's PDI


Details
What You'll Hear About
R is a reliable and versatile tool for data munging. Unfortunately, base R data load and transform processes are slow for larger data sets.
To alleviate this problem, the contributors to the data.table package in R have rewritten many data flow tools in C, with dramatic speed gains. Moreover, data.table can often do with one line what base-R users would require a page of code for. This has come at the cost of a rather demanding coding syntax.
During this talk, I will try to partially demystify data.table by going over a limited basic set of data.table operations, benchmarking against base-R as I go. For even larger data.sets, using R for ETL becomes unwieldly, so, time allowing, I will also attempt to demonstrate a few basic uses of Pentaho's PDI toolset.
About Our Speaker
Serban Tanasa is a managing director of Sunstone Science, a new innovative Business Intelligence and Analytics startup in the DC Metro Area.
A migrant from the world of academia and international development, he is also serving as the director of analytics and research at CSBS, where he is currently building a BI solution from the ground up.
Serban's hard at work bridging the gap between traditional Business Intelligence and the new wave of analytics-heavy methods surrounding Big Data.

Blazing Fast Data Load and Transform with R's data.table and Pentaho's PDI