Thanks to the folks at O'Reilly and their generous sponsorship we are pleased to once again hold our BARUG meeting onsite at the Strata Hadoop World conference (http://conferences.oreilly.com/strata/hadoop-big-data-ca). The meeting is open to all: conference registration is Not required. However, if you would like to attend the conference please take advantage of the discount code above.
6:30 Refreshments and networking
7:05 David Smith: R at Microsoft
7:35 Hadley Wickham: Managing Many Models
8:15 Erin LeDell: Model Management: Ensemble Edition
R at Microsoft
Since its acquisition of Revolution Analytics in April 2015, Microsoft has embarked upon a project to build R and Revolution’s technology into Microsoft data platform products, so companies, developers and data scientists can use it across on-premises, hybrid cloud and Azure public cloud environments. In this talk I will share some progress that has been made at Microsoft on integrating R, give a demonstration of R integrated into the Microsoft stack, and provide some details on what you can expect in the future.
Managing many models
Visualisation alone is not enough to solve most data analysis challenges. The data may be too big or too messy to show in a single plot. In this talk, I'll outline my current thinking about how the synthesis of visualisation, modelling, and data manipulation allows you to effectively explore and understand large and complex datasets.
There are three key ideas:
1. Using tidyr to make nested data frame, where one column is a list of
2. Using purrr to use function programming tools instead of writing
3. Visualising models by converting them to tidy data with broom, by
This work is embedded in R so I'll not only talk about the ideas, but show concrete code for working with large sets of models. You'll see how you can combine the dplyr and purrr packages to fit many models, then use tidyr and broom to convert to tidy data which can be visualised with ggplot2.
Model Management: Ensemble Edition
Keying off Hadley's model management theme, I will give a short presentation on the "ensemble approach" to machine learning model management. Rather than selecting a single best model, we will explore how to combine the predictive power of many models to increase the overall performance of an estimator.
Model stacking, also called "Super Learning", is an ensemble method that learns the optimal combination of a set of base learners using a secondary learning process called metalearning. Unlike other types of ensembles techniques, such as the Random Forest algorithm, the stacking algorithm thrives on a diverse set of strong base learners. We will discuss techniques for generating collections of models that perform well in a stacked ensemble. Code to quickly generate sets of diverse base models and code for training stacked ensembles in R will be provided.