6:30PM - Pizza and networking
6:55 - Announcements
7:00 - Ankur Gupta: Quirks of R7:15 - Ankur Gupta: Quirks of R
7:15 - Dex Groves: Winning 24-hr Predictive Modeling Competitions
7:30 - Leland Wilkinson: Multidimensional outlier detection
7:55 - Sudha Subramanian: The Advantages of Clustering
8:10 - Skye Bender-deMoll: The ndtv (https://mran.revolutionanalytics.com/package/ndtv/) package
8:25 - Derek Damron: The budgetr package
8:40 - Aaron Hoffer: Extending Shiny: Building reactive drag & drop elements
Aknur Gupta is a Quantitative Researcher at The Climate Corporation who use R for most of his work projects and also writes about R on his blog (http://www.perfectlyrandom.org/2015/06/16/never-trust-the-row-names-of-a-dataframe-in-R/).
Quirks of R
R is a very forgiving programming language. When the user executes an ambiguous piece of code, R makes assumptions and returns some output instead of throwing an error. In most of the cases, the assumptions made by R are reasonable and the corresponding output is what the user expects. But, occasionally, R returns an output that the user did not intend. As a result, there are silent mistakes which may go unnoticed. In this talk, I will talk about two such quirks of R -- non-standard evaluation and vector indexing. Using small examples, I will show how (1) pervasive these issues are in R, and (2) some ways to avoid it.
Dex Groves is a data scientist at Allstate, where internal modeling competitions are run quarterly with a strict time limit of 24 hours.
Winning 24-hr Predictive Modeling Competitions
The competitions are a predictive battleground for Allstate's 170+ data heavyweights to put their code where their mouth is. Competing in this format is vastly different from normal projects or Kaggles.
What types of models make sense?
How best can you build a nimble codebase with multiple collaborators?
What is the most effective way to validate?
All this and more will be revealed in a lightning talk on lightning competitions.
Leland Wilkinson is a statistician and computer scientist at H2O.ai, author of the Grammar of Graphics (https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448), and Adjunct Professor of Computer Science at University of Illinois at Chicago.
Multidimensional outlier detection with the HDoutliers (https://mran.revolutionanalytics.com/package/HDoutliers/) package
Outliers have more than two centuries’ history in the field of statistics. Recently, they have become a focal topic because of their relevance to terrorism, network intrusions, financial fraud, and other areas where rare events are critical to understanding a process. A new algorithm, called hdoutliers, is unique for a) dealing with a mixture of categorical and continuous variables, b) dealing with the curse of dimensionality (many columns of data), c) dealing with many rows of data, d) dealing with outliers that mask other outliers, and e) dealing consistently with unidimensional and multidimensional datasets. Unlike ad hoc methods found in many machine learning papers, hdoutliers is based on a distributional model that allows outliers to be tagged with a probability. An R package called hdoutliers is available on CRAN.
Sudha Subramanian is a Sr. DataAnalyst at DCS Information Systems, with over 18 years of professional experience performing extensive data analyses and building analytics-based solutions. Of all facets of Data Science and Predictive Analytics, data exploration and identifying patterns fascinates me the most.
The Advantages of Clustering
This talk is based on a paper that examines how the results of combining models built on subpopulation (MSP) compares to that of model built on the entire population (MEP). Candidates with similar characteristics are grouped together by applying clustering techniques on the datasets. It is seen that patterns within each subpopulation aid to enhance accuracy of the individual models. An increase in accuracy by 1-3% is substantial when modeling big data. Also, this increase in accuracy is achieved without over-fitting to the training set.
This technique of predicting within subpopulations of the data offers a deeper understanding of underlying patterns in the data. This approach provides the flexibility to apply different algorithms to each subpopulation, thus giving an edge over the single-model built on the entire population.
The Network Dynamic Temporal Visualization (ndtv) package provides tools for visualizing changes in network structure and attributes over time. It works with encoded longitudinal network information as its input, and outputs animated movies. It is part of the statnet (http://statnet.csde.washington.edu/) suite of packages. For more information, see Skye's full length tutorial (http://statnet.csde.washington.edu/workshops/SUNBELT/current/ndtv/ndtv_workshop.html).
Derek Damron works as a data scientist at Allstate. When he isn't working he enjoys video games, cats, and shag carpet.
The budgetr package
Budgeting, for better or worse, becomes a necessity by adulthood. Unfortunately, budgeting is seldom taught to us growing up so it can be difficult to know how to get started. The budgetr package is designed to give you a simple framework for creating budgets, visualizing them, and updating them as reality changes. In this short talk we'll discuss how budgetr is structured and go through some examples of how you can use it to easily create, visualize, and update your own budgets.
Extending Shiny: Building reactive drag & drop elements