PL/R and PivotalR in MPP databases, future of PivotalR w/ Hai Qian
6:00-6:30 pm Food and networking
6:30-8:00 pm Talk and Q&A
8:00-8:30 pm Wind down
In recent years, Big Data has become an important research topic and a very realistic problem in the industry. The amount of data that we need to process is exploding, and the ability of analyzing big data has become the key factor in competition. Big data sets do not fit into a computer’s memory and it would be really slow if the big data sets were processed sequentially. On the other hand, most contributed packages of R are still strictly sequential, single machine, and are restricted to small data sets that can be loaded into memory. As computing shifts irreversibly to parallel architectures and big data and many big data sets are stored in databases, there is a risk for the R community to become irrelevant.
Here we will introduce how to run machine learning algorithms on these data sets using R.
PL/R is an extension to Postgres databases that enables the user to send any R scripts directly into the database to execute. It has some limitations, which are overcome by PivotalR. PivotalR is a R package that provides a front-end to PostgreSQL and all PostgreSQL-like databases or HDFS. PivotalR also provides the R wrapper for MADlib, an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical, and machine-learning algorithms for structured and unstructured data. Thus PivotalR also enables the user to apply machine learning algorithms onto big data. PivotalR adds functionality that does not exist in MADlib right now, for example, the support for categorical variables.
Hai Qian, a senior software engineer in Pivotal Inc's Predictive Analytics Team working on the development of the in-database machine learning library MADlib.
****SF Giants have a home game today @ 12:45 pm ending approximately 3:45 pm and may cause traffic to be more congested than usual****