6:30 PM: Pizza and networking
7:05: Robert Horton: AUC Meets the Wilcoxon-Mann-Whitney U-Statistic
7:20: Rupal Agrawal - Contributing to OpenElections (Open Data) Using R
7:35: Nick Pelikan & Persephone Tsebelis - Automated Powerpoint Generation using the ReporterRs package
7:50: John Mount - Standard versus non-standard calling conventions in R: examples with dplyr and replyr
8:05: Earl Hubbell - Personal Genomics with R and Bioconductor
8:20: Ali-Kazim Zaidi - sparklyr: A tidy R interface for Apache Spark
AUC Meets the Wilcoxon-Mann-Whitney U-Statistic
Contributing to OpenElections (Open Data) Using R
In the US, election results are not reported by any single federal agency. Instead, each state & county reports in a variety of formats -- HTML, PDF, CSV, often with very different layouts and varying levels of granularity. There are a number of elections, besides the Presidential -- primaries, mid-term and special for various offices (US Senate, US House, State legislatures, Governor, etc.)
There is no freely available comprehensive source of official election results, for people to use for analysis. I have been volunteering with OpenElections towards creating such a source. I use R to automate some of these tasks - web-scraping, PDF conversion and for data manipulation to produce the desired outputs in a consistent format. In this lightning talk, using real examples from multiple US states, I will highlight some of the challenges I faced. I will also share some of the R packages I used – Rselenium, pdftools, tabulizer, aimed to help others wishing to volunteer with similar Open Data efforts.
Nick Pelikan & Persephone Tsebelis
Automated Powerpoint Generation using the ReporteRs package
We'd like to discuss automated Powerpoint generation via R using the ReporteRs package (http://davidgohel.github.io/ReporteRs/). ReporteRs is a particularly useful package because it allows data scientists to add R plots to powerpoint presentations as editable vector graphics, enabling annotation and edits by non-technical reviewers. Our presentation will cover the capabilities of the package, common problems and a demonstration using code we've developed to automate presentations for YouGov teams.
Standard versus non-standard calling conventions in R: examples with dplyr and replyr”
I will discuss the ease and utility of using standard or parametric (variable names as strings or as values inside other variables) versus non-standard (variable names captured directly from use expressions) in R. Non-standard name capture is more convenient for interactive use when you are working only over one or two variables and you know all of the variable names ahead of time. dplyr, ggplot2, and many other packages prefer non-standard calling conventions. However, non-standard name capture is awkward to script or program over; and it is is a great tragedy when you can not automate a fully virtual process (such as a common transform or analysis). Given the asymmetry in cost of notation conversion I argue we should prefer standard or parametric interfaces. I will quickly define terms, demonstrate the problem, exhibit some packages that emphasize standard name capture (sigr, WVPlots), and announce a package that allows for more convenient programming over packages that prefer non-standard notation (replyr)
Personal Genomics with R and Bioconductor
sparklyr: A tidy R interface for Apache Spark
The sparklyr package provides a tidy R interface for Apache Spark. It's never been easier for a R programmer to develop distributed statistical learning algorithms and deploy them in a distributed cluster. In this talk, we will see how to use the sparklyr to do tidy data analysis on large datasets, train and validate machine learning algorithms using SparkML, and also dive into sparklyr's extension mechanism to invoke Spark's GraphFrames package for doing network analysis.