• "Official" June 2019 Meetup

    GRAIL

    Agenda: 6:30 - Pizza and Networking 7:00 - Welcome to GRAIL 7:10 - Other announcements 7:15 - Alice Milivinti: "Bayesian Regression Modeling in R With Stan" 7:40 - GRAIL: Siddhartha Bagaria: “Levels of reproducible analysis” 8:10 - Norm Matloff: "What about Time Series?" ###### Alice Milivinti "Bayesian Regression modeling in R With Stan" Abstract: Bayesian analysis in R has never been so easy. The package rstan provides the R interface to the Stan C++ library for Bayesian estimation. Alice will give an overview and compare the brms and rstanarm packages which allow users to specify models via the R syntax without any need of C++ code. ###### Norm Matloff Title: "What about Time Series?" Abstract: In past BARUG talks, I've presented an alternative to neural networks and a new method for dealing with missing values. In this talk, I apply these methods to time series data.

  • "Official" May 2019 BARUG Meetup

    Instacart

    Dear Attendees: Building security requires your full name for check in prior to the event. Upon check in, you will also be asked to sign an NDA. Agenda: 6:30: Food, drink and Networking 7:00: Welcome to Instacart 7:05: Announcements 7:10: Robert Horton: (Lightning talk): Finding the acronym definitions in unstructured text 7:25: Jingjie Xiao: A/B testing for logistics with R 7:50: Peter Li (Lightning Talk) keeping score (:-/) with 'packageRank 8:05: Steve Dahlke: Applied Optimization in R - Energy Markets and Policy ***************** Robert Horton Finding the acronym definitions in unstructured text This will be an introductory presentation focusing on how to use parentheses and backreferences in regular expressions. I programmatically construct a set of regular expressions matching acronym definitions of different lengths, and collect all the results into a table. ****************** Jingjie Xiao, Senior Data Scientist, Instacart A/B testing for logistics with R Instacart's fulfillment dispatching engine is constantly changing—storms, unexpected traffic, shift cancellations, and more can all affect a system that is dynamic, interdependent, and noisy. In this talk, Instacart Senior Data Scientist, Jingjie Xiao will walk you through how controlled experiments and multivariate regression are used to continuously improve the grocery delivery engine @ Instacart. ****************** Peter Li keeping score (:-/) with 'packageRank Building on the efforts of the 'cranlogs' package, 'packageRank' puts the raw counts of package downloads into greater perspective. Numerically, it computes a package's rank percentile among all downloads. Visually, it locates a package' position in the distribution of downloads. Currently, these snapshot are available for individual days (cross-sectionally), and in more limited fashion over time (longitudinally). Along the way, I'll show how I use the 'memoise' package to cache the downloading of log files. ******************* Steve Dahlke is PhD candidate in applied economics enrolled at the Colorado School of Mines Applied Optimization in R - Energy Markets and Policy Steve is building an electricity market model to study the impact of carbon policy on the U.S. electricity sector. In this talk he will explain the method for setting up an optimization model in R to study a particular energy policy and show how to use the LPSolve package: to solve it.

    1
  • "Official" April 2019 BARUG Meetup

    23andMe

    Agenda: 6:30 Pizza and Networking 7:00 Fah Sathirapongsasuti: Welcome to 23andMe 7:05 Announcements 7:10 Robert Gentleman: ALTREP: Alternate representations of basic R objects 7:25: Hui Lin: Use blogdown + Netlify to build and deploy your website 7:45: Robert Horton: Clustering of categorical features using community detection on directed co-occurrence graphs 8:15: Bri Cameron: Exploring activity patterns with HealthKit and R ************************ Hui Lin Use blogdown + Netlify to build and deploy your website This presentation will show you how to use the R package 'blogdown' and Netlify to build and deploy your website. We will go through the whole process from creating a simple site to having the website up and running. We will also give a brief overview of different ways to build and deploy a website using Netlify. It will be hands-on. R code examples and website templates are available on GitHub for participants to follow along on their laptops. ************************* Robert Horton Clustering of categorical features using community detection on directed co-occurrence graphs It is often convenient to characterize items with sets of labels, for example, articles in a database might tagged with several keywords per article. These tags can be useful for indexing the items (as in keyword search), but they may difficult to use in machine learning, particularly if there are a large number of different tags and some of them are more or less rare. I will demonstrate several examples of constructing conditional co-occurrence graphs from tagset types of data, and how to cluster the items (or tags) using the Louvain community detection algorithm. These clusters may be useful as low-cardinality features in machine learning. ************************ Dr. Bri Cameron is a biostatistician in Data Collection at 23andME Exploring activity patterns with HealthKit and R

    1
  • "Official" February 2019 BARUG Meetup

    Microsoft Reactor

    Image from: https://bit.ly/2SyjaNm Agenda: 6:30 - Pizza and Networking 7:00 - Announcements 7:05 - Joseph Rickert - Searching for R packages (Lightning talk) 7:20 - Ali Zaidi, Bob Horton, and Mario Inchiosa- Using transfer learning, active learning, and hyperparameter tuning to train high-performance text classifiers with limited labels. 8:05 - Erin LeDell - Scalable Automatic Machine Learning with H2O #################### Searching for R packages Joseph Rickert Finding the right R package to do something of interest is one of the most vexing problems for new R users. I will highlight a few R packages that are useful for searching for other packages, and describe a simple strategy for using them. #################### Using transfer learning, active learning, and hyperparameter tuning to train high-performance text classifiers with limited labels. Ali Zaidi, Bob Horton, and Mario Inchiosa The labels given to training examples are one of the main ways in which human judgement can be represented in patterns that computers can learn. Unfortunately, even when data is plentiful, labels suitable for supervised machine learning may not be. If determining a label requires significant effort or expertise, collecting labels can be a slow or expensive process. We will show three approaches that can be integrated together to help you make the best use of a limited labeling budget: 1) Use transfer learning from complex language models trained on large datasets to generate features that can be used by simple classifiers capable of learning from small datasets, 2) employ these classifiers in an active learning process to judiciously select the most useful cases to label so you can iteratively build better models, and 3) optimize the whole process by careful tuning of hyperparameters. ####################### Scalable Automatic Machine Learning with H2O Erin LeDell The focus of this presentation is scalable and automatic machine learning using the H2O machine learning platform. We will provide a brief overview of the field of Automatic Machine Learning, followed by a detailed look inside H2O's AutoML algorithm, available in the "h2o" R package. H2O AutoML provides an easy-to-use interface which automates data pre-processing, training and tuning a large selection of candidate models (including multiple stacked ensemble models for superior model performance), and due to the distributed nature of the H2O platform, H2O AutoML can scale to very large datasets. The result of the AutoML run is a "leaderboard" of H2O machine learning models which can be easily exported for use in production. R code examples are available on GitHub for participants to follow along on their laptops.

    2
  • January 2019 "Official" BARUG Meetup

    Intuit Building 20

    Agenda: 6:30 - Pizza and Networking 7:00 - Announcements 7:05 - Andrew Mangano - Introduction to the Reticulate package (Lightning Talk) 7:20 - Guoli Sun - Code as a designer, a glimpse to R Shiny applications 7:40 - Ward Greunke - Armchair Finance: A layman's guide to understanding finance, options trading and crypto currencies. 8:00 - Pete Mohanty - Declaring and Diagnosing Research Designs #################### Andrew Mangano Introduction to the Reticulate package The reticulate package bridges the R/Python language divide between data science teams and allows for previously unachievable collaboration. I will show how this innovative package has allowed the data science team to work smarter and faster at Safeway eCommerce as well as how RStudio can now be used as the single platform tool for a data science leader." #################### Guoli Sun Code as a designer, a glimpse to R shiny applications R Shiny give data scientist flexibility to not only develop modeling but also offer user interactive applications. This talk will include introduction to Shiny, Shiny dashboards and Shiny server. The talk will be followed by a demo. Application demo includes usages of R packages, such as reticulate, ggplot2, plotly and DT. #################### Ward Greunke Armchair Finance: A layman's guide to understanding finance, options trading and crypto currencies. After a brief introduction to options, I will introduce this financial "what if" application that helps users understand the pay off of different options depending on the strike price and the cost of the option. With Armchair Finance (http://armchairfinance.blogspot.com/2018/04/visualizing-options.html) you can try out different option strategies. ##################### Pete Mohanty Declaring and Diagnosing Research Designs Researchers and analysts routinely two research design problems: first, we need to select a high-quality design, given resource constraints. Even when data are cheap, we may wish to know how to learn about an effect most efficiently, which could save time or minimize liability. Second, we need to convince interested parties of the design’s high quality. However quantities like power and bias can be difficult to calculate outside of the most basic research designs. In this talk I introduce DeclareDesign (DeclareDesign.org), a new suite of R packages which allow researchers to flexibly declare the key dynamics of the data generating process, to work with existing or simulated data, and to compare estimators. DeclareDesign offers a “grammar of design and diagnosis” which is compatible with many R libraries. DeclareDesign enables researchers to confidently choose the best way to answer the question at hand.

    1
  • "Official" December 2018 BARUG Meetup

    GRAIL

    Agenda: 6:30 - Pizza and Networking 7:00 - Announcements 7:05 - Anirudh Acharya: An Introduction to the MXNet-R package (lightning talk) 7:20 - Emma Rudie - RStudio Connect at GRAIL 7:40 - Peter Li - Have you tesselated today? 8:05 - Dan Putler- Some Intuition on Why Collinearity Can Be "Bad", and When It Should Be a Concern (lightning talk) 8:20 - Rami Krispin - Introduction to the TSstudio package ############## Anirudh Acharya Introduction to the MXNet-R package Apache(Incubating) MXNet(https://github.com/apache/incubator-mxnet) is a modern open-source deep learning framework used to train, and deploy deep neural networks. It is scalable and supports multiple programming languages including C++, Python, Julia, R, Scala, and Perl. I will briefly introduce theMXNet-R package and run an example. ############# Peter Li Have you tessellated today? What is Voronoi (aka Dirichlet) tessellation and why would you ever want to use it? Imagine you want to compute and visualize the "neighborhoods" (i.e., catchment or service areas) of the coffee shops in your town. Using just their locations and some measure of proximity, you can use Voronoi tessellation to carve up the map of your town into "tiles" that describe those neighborhoods. I'll show how to do this using the 'deldir' package. I'll conclude by showing how to enhance your analysis by color coding tiles and by counting elements within those tiles (e.g., estimating the number of customers). ############## Dan Putler Some Intuition on Why Collinearity Can Be "Bad", and When It Should Be a Concern Many people who build predictive models know that predictor collinearity (also called multicollinearity) is “bad”, but they often don't have the intuition as to what makes it bad, and when it will be relatively more or less bad. The goal of this talk is to provide some intuition behind the consequences of predictor collinearity, and provide some rules of thumb about when, and when not, to be concerned about it. Spoiler alert: There are many situations where it is of little concern. ############## Rami Krispin Introduction to the TSstudio package The TSstudio package provides a set of functions for time series analysis and forecasting. That includes interactive data visualization tools, utility functions for preprocessing time series data, and as well backtesting applications for forecasting models from the forecast, forecastHybrid and bsts packages. My talk is going to focus on the main functionality of the package (seasonal plots, forecasting using backtesting, etc.)

    1
  • "Official" November BARUG Meetup

    Databricks

    Agenda: 6:30 - PM Pizza and Networking 7:00 - Announcements 7:05 - Anirudh Acharya - MXNet-R Lightning talk 7:20 - Tomas Nykodym - MLflow: Infrastructure for a Complete Machine Learning Life Cycle 7:45 - Javier Luraschi - Introduction to MLflow with R 8:10 - Norm Matloff - PolyanNA, a Novel, Prediction-Oriented R Package for Missing Values ############## Anirudh Acharya Introduction to the MXNet-R package Apache(Incubating) MXNet(https://github.com/apache/incubator-mxnet) is a modern open-source deep learning framework used to train, and deploy deep neural networks. It is scalable and supports multiple programming languages including C++, Python, Julia, R, Scala, and Perl. I will briefly introduce theMXNet-R package and run an example. ################ Tomas Nykodym MLflow: Infrastructure for a Complete Machine Learning Life Cycle ML development brings many new complexities beyond the traditional software development lifecycle including evaluating multiple algorithms, and parameters, setting up reproducible workflows, and integrating distinct systems into production models. In this talk, I will present MLflow, a new open source project from Databricks, that provides an open ML platform where organizations can use the ML libraries and development tools of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size. Tomas Nykodym is an ML/AI Platform Engineer at Databricks working on MLflow. He spent his last 6 years working on cutting edge distributed machine learning projects at H2O.ai and Databricks. His professional interests include distributed computing, applied math and machine learning. ############### Javier Luraschi Introduction to MLflow with R This talk will teach you how to use MLflow from R to track model parameters and results, share models with non-R users and fine-tune models at scale. It will present the installation steps, common workflows and resources available for R. It will also demonstrate using MLflow tracking, projects and models directly from R as well as reusing R models in MLflow. Javier is a Software Engineer in RStudio working in R packages, most notably, sparklyr, cloudml, r2d3 and mlflow. ############## Norm Matloff PolyanNA, a Novel, Prediction-Oriented R Package for Missing Values Though there is a vast literature on techniques for handling missing values, almost all of it is focused on estimation, rather than on prediction. Here we present a novel approach developed specifically for use in prediction applications, implemented in an R package, 'polyanNA'. It can be used in both parametric and machine learning settings, and is very fast computationally. (Joint work with Pete Mohanty.)

    9
  • "Official" October 2018 BARUG Meetup

    Leavey School of Business at Santa Clara University

    Agenda: 7:00 Announcements 7:05 Dave Hurst- Sharing Artifacts in a Corporate Environment 7:20 Peter Li - Introduction to the cholera package 7:50 Ryan Moran - Bozo Blocking: rapid development of fraud detection models over local entity networks 8:20 Michael Kevane - Things that go wrong in R exercises for undergraduates: Scatterplots, wrangling, maps, sentiment analysis, RMarkdown ========================== Peter Li Introduction to the 'cholera' package John Snow's map of the 1854 cholera outbreak in London's Soho is a classic example of data visualization. For Snow, the map helped to support his two then contested, if not controversial claims: that cholera is a waterborne disease and that the water pump on Broad Street was the source of the outbreak. To evaluate whether the map does or can actually supports such claims, I created the 'cholera' R package (CRAN and GitHub). The package allows you to explore, analyze and test the data embedded in the map. It does so by computing and plotting a pump's neighborhood: the set of locations defined their "proximity" to a pump. The talk will focus on the tools and techniques used to compute and visualize these "pump neighborhoods" and will include examples (all in R) of everything from orthogonal projection to more specialized topics like Voronoi tessellation ('deldir'), spatial data analysis ('sp'), graph/network analysis ('igraph'), generic functions (e.g., S3 generic functions), and embarrassingly parallel problems ('parallel'). =============================== Ryan Moran Data + Fraud, Bandcamp.com Bozo Blocking: rapid development of fraud detection models over local entity networks Fraud mitigation is a demanding challenge, particularly for a small organization like Bandcamp.com. In order to enhance defenses against an ever-churning tide of brilliant villains, we developed a platform in R to reduce the labor, time, and resources required to prepare, fit, and compare complex regression models. In this talk, we'll first cover several core aspects of system design, including only-as-necessary parallel predictor computation, persistent high-performance MySQL caching, and automated parameter optimization. For the finale, we'll explore the platform's most powerful capability: "local" network summary predictors, particularly those computed over the outputs of other predictive models.

    4
  • "Official" September BARUG meetup

    Hacker Dojo

    Agenda: 6:30PM - Pizza and networking 6:55 - Announcements 7:00 - Isaac Faber: MatrixDS, A Data Science Workbench 7:15 - Anqi Fu: Disciplined Convex Optimization with CVXR 7:45 - Jeffrey Wong: Scaleable Causal Inference using R and Rcpp 8:15 - Neil Gunther: Why Are There No Giants: The Data Analytics of Scalability ################# Isaac Faber MatrixDS, A Data Science Workbench ################### Anqi Fu Disciplined Convex Optimization with CVXR Abstract: CVXR is an R package that provides an object-oriented modeling language for convex optimization, similar to CVX, CVXPY, YALMIP, and Convex.jl. It allows the user to formulate convex optimization problems in a natural mathematical syntax rather than the restrictive standard form required by most solvers. The user specifies an objective and set of constraints by combining constants, variables, and parameters using a library of functions with known mathematical properties. CVXR then applies disciplined convex programming (DCP) to verify the problem's convexity. Once verified, the problem is converted into standard conic form using graph implementations and passed to a solver such as ECOS or MOSEK. We demonstrate CVXR's modeling framework with applications in engineering, statistical estimation, and machine learning. For more information, visit our CRAN page and the official website cvxr.rbind.io. #################### Jeffrey Wong, Senior Research Scientist at Netflix. Scaleable Causal Inference using R and Rcpp At Netflix we’re building an experimentation platform to manage and scale the causal analysis of many experiments. To support advancing methodology in statistics, we need to build off of a stack that is familiar to data scientists and is performant. We are building the core statistical computing engine using Rcpp, and furthermore we have optimized the entire stack specifically for sparse data. In this talk we will show how we tackle various computational challenges in large scale causal inference, such as data manipulation, modeling, and treatment effect estimation. ###################### Neil Gunther Why Are There No Giants? The Data Analytics of Scalability Abstract: The 30 ft. giant in 'Jack and the Beanstalk' is truly a fairy tale. No documented human has ever exceeded 10 feet. Why not? After examining the nonlinear constraints on mechanical scalability, we'll segue into how to apply nonlinear regression models in R to big data that determines the scalability constraints on ALL computer architectures and applications.

    2
  • "Official" August 2018 BARUG Meetup

    Instacart

    Lightning Talk Evening! Agenda: 6:30 - Pizza and networking 6:55 - Announcements 7:00 - Eric Rynerson - Bar mekko charts in R 7:15 - Andrew Mangano - Using R and Leaflet for Maps 7:30 - Avantika Lal - Using Sparse Matrix Factorization to Identify Cancer 7:45 - Neal Richardson - Wrapping Web APIs in R 8:00 - Lynna Jirpongopas - Programming in R on AWS SageMaker 8:15 - Jonathan Spring - Analyzing Visitation Patterns with the Tidyverse 8:30 - Earl Hubbell - #TidyTuesday: Fun with the Tidyverse #################### Eric Rynerson Bar mekko charts in R If you’ve ever added variable width to your bar chart you were likely underwhelmed with the results. The bar mekko is a better alternative for drawing the audience’s attention to what matters, and there is now a simple package for producing them in R. I’ll explain when and why you should consider using the bar mekko chart and briefly demonstrate how to create them with the mekko package. #################### Andrew Mangano Using R and Leaflet for Maps I use Leaflet map functions quite a bit in my role as a director of analytics and data science for Safeway eCommerce and I’ve found the basics of leaflet to be both easy and engaging. I will show examples from my work along with additional general examples. Avantika Lal Using Sparse Matrix Factorization to Identify Cancer The SparseSignatures package uses sparse matrix factorization to identify mutational signatures in cancer. Given a dataset of mutations in tumors, our method identifies signatures of individual mutational processes like smoking, UV light exposure, genetic defects - so it can be used to identify the underlying causes of a patient's cancer. ################## Neal Richardson Wrapping Web APIs in R There's lots of interesting and useful data available online, whether from open data sources or from subscription services, such as Twitter and Google Analytics. Getting the data into R for exploration is the first challenge. While popular APIs have R packages that let you access them conveniently, sometimes you need to write your own API connections. In this talk, I'll show how to quickly write an API wrapper in R, working through a real example. The goal is to set up a minimal, robust, and extensible foundation for working with the API in R so that we can spend less time thinking about how to get the data in and more time doing fun things with it. ################## Lynna Jirpongopas Programming in R on AWS SageMaker The talk is intended for R programmers of all levels that are interested in leveraging AWS infrastructure. I'm a novice at using AWS SageMaker but have been using R in RStudio environment for many years. Recently, I have been working with data hosted in AWS S3, and therefore AWS SageMaker (https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html) became a smarter choice of tool for data analysis with R. Sagemaker provides hosted Jupyter Notebook, but what makes it more powerful than a regular notebook, is the capability for data scientists to build, train, and deploy your machine learning model all in one platform. ################# Jonathan Spring Analyzing Visitation Patterns with the Tidyverse I manage the budget at SFMOMA and handle much of our data analysis. I will present some neat examples of analyzing our visitation patterns (when they visit, where from, repeat visits, etc.) using R and the tidyverse. ################# Earl Hubbell #TidyTuesday: Fun with the Tidyverse TidyTuesday is a project of @rfordatascience in which datasets are published every week, and explored using the tidyverse tools for summarizing, arranging, and making meaningful graphs.

    5