Skip to content

Details

What You'll Hear About

R is a reliable and versatile tool for data munging. Unfortunately, base R data load and transform processes are slow for larger data sets.

To alleviate this problem, the contributors to the data.table package in R have rewritten many data flow tools in C, with dramatic speed gains. Moreover, data.table can often do with one line what base-R users would require a page of code for. This has come at the cost of a rather demanding coding syntax.

During this talk, I will try to partially demystify data.table by going over a limited basic set of data.table operations, benchmarking against base-R as I go. For even larger data.sets, using R for ETL becomes unwieldly, so, time allowing, I will also attempt to demonstrate a few basic uses of Pentaho's PDI toolset.

About Our Speaker

Serban Tanasa is a managing director of Sunstone Science, a new innovative Business Intelligence and Analytics startup in the DC Metro Area.

A migrant from the world of academia and international development, he is also serving as the director of analytics and research at CSBS, where he is currently building a BI solution from the ground up.

Serban's hard at work bridging the gap between traditional Business Intelligence and the new wave of analytics-heavy methods surrounding Big Data.

Members are also interested in