• "Official" November BARUG Meetup
    Agenda: 6:30 - PM Pizza and Networking 7:00 - Announcements 7:05 - Anirudh Acharya - MXNet-R Lightning talk 7:20 - Tomas Nykodym - MLflow: Infrastructure for a Complete Machine Learning Life Cycle 7:45 - Javier Luraschi - Introduction to MLflow with R 8:10 - Norm Matloff - PolyanNA, a Novel, Prediction-Oriented R Package for Missing Values ############## Anirudh Acharya Introduction to the MXNet-R package Apache(Incubating) MXNet(https://github.com/apache/incubator-mxnet) is a modern open-source deep learning framework used to train, and deploy deep neural networks. It is scalable and supports multiple programming languages including C++, Python, Julia, R, Scala, and Perl. I will briefly introduce theMXNet-R package and run an example. ################ Tomas Nykodym MLflow: Infrastructure for a Complete Machine Learning Life Cycle ML development brings many new complexities beyond the traditional software development lifecycle including evaluating multiple algorithms, and parameters, setting up reproducible workflows, and integrating distinct systems into production models. In this talk, I will present MLflow, a new open source project from Databricks, that provides an open ML platform where organizations can use the ML libraries and development tools of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size. Tomas Nykodym is an ML/AI Platform Engineer at Databricks working on MLflow. He spent his last 6 years working on cutting edge distributed machine learning projects at H2O.ai and Databricks. His professional interests include distributed computing, applied math and machine learning. ############### Javier Luraschi Introduction to MLflow with R This talk will teach you how to use MLflow from R to track model parameters and results, share models with non-R users and fine-tune models at scale. It will present the installation steps, common workflows and resources available for R. It will also demonstrate using MLflow tracking, projects and models directly from R as well as reusing R models in MLflow. Javier is a Software Engineer in RStudio working in R packages, most notably, sparklyr, cloudml, r2d3 and mlflow. ############## Norm Matloff PolyanNA, a Novel, Prediction-Oriented R Package for Missing Values Though there is a vast literature on techniques for handling missing values, almost all of it is focused on estimation, rather than on prediction. Here we present a novel approach developed specifically for use in prediction applications, implemented in an R package, 'polyanNA'. It can be used in both parametric and machine learning settings, and is very fast computationally. (Joint work with Pete Mohanty.)

    Databricks

    160 Spear, 13th Floor · San Francisco, CA

    9 comments
  • "Official" October 2018 BARUG Meetup
    Agenda: 7:00 Announcements 7:05 Dave Hurst- Sharing Artifacts in a Corporate Environment 7:20 Peter Li - Introduction to the cholera package 7:50 Ryan Moran - Bozo Blocking: rapid development of fraud detection models over local entity networks 8:20 Michael Kevane - Things that go wrong in R exercises for undergraduates: Scatterplots, wrangling, maps, sentiment analysis, RMarkdown ========================== Peter Li Introduction to the 'cholera' package John Snow's map of the 1854 cholera outbreak in London's Soho is a classic example of data visualization. For Snow, the map helped to support his two then contested, if not controversial claims: that cholera is a waterborne disease and that the water pump on Broad Street was the source of the outbreak. To evaluate whether the map does or can actually supports such claims, I created the 'cholera' R package (CRAN and GitHub). The package allows you to explore, analyze and test the data embedded in the map. It does so by computing and plotting a pump's neighborhood: the set of locations defined their "proximity" to a pump. The talk will focus on the tools and techniques used to compute and visualize these "pump neighborhoods" and will include examples (all in R) of everything from orthogonal projection to more specialized topics like Voronoi tessellation ('deldir'), spatial data analysis ('sp'), graph/network analysis ('igraph'), generic functions (e.g., S3 generic functions), and embarrassingly parallel problems ('parallel'). =============================== Ryan Moran Data + Fraud, Bandcamp.com Bozo Blocking: rapid development of fraud detection models over local entity networks Fraud mitigation is a demanding challenge, particularly for a small organization like Bandcamp.com. In order to enhance defenses against an ever-churning tide of brilliant villains, we developed a platform in R to reduce the labor, time, and resources required to prepare, fit, and compare complex regression models. In this talk, we'll first cover several core aspects of system design, including only-as-necessary parallel predictor computation, persistent high-performance MySQL caching, and automated parameter optimization. For the finale, we'll explore the platform's most powerful capability: "local" network summary predictors, particularly those computed over the outputs of other predictive models.

    Leavey School of Business at Santa Clara University

    Address: 500 El Camino Real, Santa Clara, CA 95053. Campus location: Forbes Room in Lucas Hall (Business School). · Santa Clara, CA

    4 comments
  • "Official" September BARUG meetup
    Agenda: 6:30PM - Pizza and networking 6:55 - Announcements 7:00 - Isaac Faber: MatrixDS, A Data Science Workbench 7:15 - Anqi Fu: Disciplined Convex Optimization with CVXR 7:45 - Jeffrey Wong: Scaleable Causal Inference using R and Rcpp 8:15 - Neil Gunther: Why Are There No Giants: The Data Analytics of Scalability ################# Isaac Faber MatrixDS, A Data Science Workbench ################### Anqi Fu Disciplined Convex Optimization with CVXR Abstract: CVXR is an R package that provides an object-oriented modeling language for convex optimization, similar to CVX, CVXPY, YALMIP, and Convex.jl. It allows the user to formulate convex optimization problems in a natural mathematical syntax rather than the restrictive standard form required by most solvers. The user specifies an objective and set of constraints by combining constants, variables, and parameters using a library of functions with known mathematical properties. CVXR then applies disciplined convex programming (DCP) to verify the problem's convexity. Once verified, the problem is converted into standard conic form using graph implementations and passed to a solver such as ECOS or MOSEK. We demonstrate CVXR's modeling framework with applications in engineering, statistical estimation, and machine learning. For more information, visit our CRAN page and the official website cvxr.rbind.io. #################### Jeffrey Wong, Senior Research Scientist at Netflix. Scaleable Causal Inference using R and Rcpp At Netflix we’re building an experimentation platform to manage and scale the causal analysis of many experiments. To support advancing methodology in statistics, we need to build off of a stack that is familiar to data scientists and is performant. We are building the core statistical computing engine using Rcpp, and furthermore we have optimized the entire stack specifically for sparse data. In this talk we will show how we tackle various computational challenges in large scale causal inference, such as data manipulation, modeling, and treatment effect estimation. ###################### Neil Gunther Why Are There No Giants? The Data Analytics of Scalability Abstract: The 30 ft. giant in 'Jack and the Beanstalk' is truly a fairy tale. No documented human has ever exceeded 10 feet. Why not? After examining the nonlinear constraints on mechanical scalability, we'll segue into how to apply nonlinear regression models in R to big data that determines the scalability constraints on ALL computer architectures and applications.

    Hacker Dojo

    3350 Thomas Road · Santa Clara, CA

    2 comments
  • "Official" August 2018 BARUG Meetup
    Lightning Talk Evening! Agenda: 6:30 - Pizza and networking 6:55 - Announcements 7:00 - Eric Rynerson - Bar mekko charts in R 7:15 - Andrew Mangano - Using R and Leaflet for Maps 7:30 - Avantika Lal - Using Sparse Matrix Factorization to Identify Cancer 7:45 - Neal Richardson - Wrapping Web APIs in R 8:00 - Lynna Jirpongopas - Programming in R on AWS SageMaker 8:15 - Jonathan Spring - Analyzing Visitation Patterns with the Tidyverse 8:30 - Earl Hubbell - #TidyTuesday: Fun with the Tidyverse #################### Eric Rynerson Bar mekko charts in R If you’ve ever added variable width to your bar chart you were likely underwhelmed with the results. The bar mekko is a better alternative for drawing the audience’s attention to what matters, and there is now a simple package for producing them in R. I’ll explain when and why you should consider using the bar mekko chart and briefly demonstrate how to create them with the mekko package. #################### Andrew Mangano Using R and Leaflet for Maps I use Leaflet map functions quite a bit in my role as a director of analytics and data science for Safeway eCommerce and I’ve found the basics of leaflet to be both easy and engaging. I will show examples from my work along with additional general examples. Avantika Lal Using Sparse Matrix Factorization to Identify Cancer The SparseSignatures package uses sparse matrix factorization to identify mutational signatures in cancer. Given a dataset of mutations in tumors, our method identifies signatures of individual mutational processes like smoking, UV light exposure, genetic defects - so it can be used to identify the underlying causes of a patient's cancer. ################## Neal Richardson Wrapping Web APIs in R There's lots of interesting and useful data available online, whether from open data sources or from subscription services, such as Twitter and Google Analytics. Getting the data into R for exploration is the first challenge. While popular APIs have R packages that let you access them conveniently, sometimes you need to write your own API connections. In this talk, I'll show how to quickly write an API wrapper in R, working through a real example. The goal is to set up a minimal, robust, and extensible foundation for working with the API in R so that we can spend less time thinking about how to get the data in and more time doing fun things with it. ################## Lynna Jirpongopas Programming in R on AWS SageMaker The talk is intended for R programmers of all levels that are interested in leveraging AWS infrastructure. I'm a novice at using AWS SageMaker but have been using R in RStudio environment for many years. Recently, I have been working with data hosted in AWS S3, and therefore AWS SageMaker (https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html) became a smarter choice of tool for data analysis with R. Sagemaker provides hosted Jupyter Notebook, but what makes it more powerful than a regular notebook, is the capability for data scientists to build, train, and deploy your machine learning model all in one platform. ################# Jonathan Spring Analyzing Visitation Patterns with the Tidyverse I manage the budget at SFMOMA and handle much of our data analysis. I will present some neat examples of analyzing our visitation patterns (when they visit, where from, repeat visits, etc.) using R and the tidyverse. ################# Earl Hubbell #TidyTuesday: Fun with the Tidyverse TidyTuesday is a project of @rfordatascience in which datasets are published every week, and explored using the tidyverse tools for summarizing, arranging, and making meaningful graphs.

    Instacart

    50 Beale Street #600 · San Francisco, CA

    5 comments
  • "Official" June BARUG Meetup
    6:30 - Pizza and Networking 7:00 - Announcements 7:05 - Siddhartha Bagaria - Calling Go from R 7:35 - Bohdan Bohdanovich Khomtchouk - Bioquilt: fight aging with computation 8:10 - Norm Matloff - Functional Approximation without Hyperparameters: Polynomial Regression as Alternatives to Neural Nets ########### Siddhartha Bagaria Calling Go from R Abstract: We will demonstrate how to link go code from R and translate basic go data types into R SEXP types. This allows existing go code to be exposed through an R interface without having to read/write intermediate files, call shell commands, or have an RPC system. ############# Bohdan Bohdanovich Khomtchouk Bioquilt: fight aging with computation Abstract: Bioquilt's aim is to use AI and machine learning to provide visually accessible and interpretable data visualizations for age-related diseases.(http://www.bioquilt.com/) Bio: Bohdan Khomtchouk, Ph.D. is a data science postdoctoral fellow working in the field of computational epigenetics in the Gozani Lab at Stanford University in Stanford, CA USA. Bohdan's research involves understanding the data science behind aging-related diseases as well as creating artificial intelligence and machine learning software to organize the world's biological information at a massive scale -- (to read more: https://profiles.stanford.edu/bohdan-khomtchouk) ############# Norm Matloff Functional Approximation without Hyperparameters: Polynomial Regression as Alternatives to Neural Nets Abstract: Despite the success of neural networks (NNs), there is still a concern among many as to their “black box” nature. Why do they work? Here we present an analyticargument that NNs are “close cousins” of polynomial regression models, with a well-defined correspondence (if not equivalence) between the two approaches.

    GRAIL

    1525 O'Brien Drive · Menlo Park

    3 comments
  • Official May 2018 BARUG Meeting
    Agenda 6:30 - Pizza and Networking 7:00 - Announcements 7:05 - Karthik Mokashi: Deep Learning in Predicting Market Movements 7:30 - John Mount: rquery: a Query Generator for Working With SQL Data Sources From R 8:10 - Dave Hurst: Double dribble: Two practical use cases for the googledrive package ############## Karthik Mokashi Deep Learning in Predicting Market Movements This talk gives a preview of work in progress. In this initiative we examine if neural networks can reliably predict the forward movement of the S&P 500 index. We examine over 50 years of daily S&P returns and engineer the data set for processing by a neural network. We have use H2O with R and Tensorflow/Python to build and validate our results. ################ John Mount, Win-Vector LLC rquery: a Query Generator for Working With SQL Data Sources From R rquery ( https://github.com/WinVector/rquery ) is an R package for data wrangling on SQL databases and Spark. I will start with a demonstration of "piped SQL" and move on to powerful operator pipelines. Piped SQL allows those merely familiar with SQL to build up powerful exprt-level multi-stage data transforms using fragments of SQL and legible pipe notation for composition (much clearer than typical SQL nesting). From there I will move on to powerful non-SQL operator notation based on Codd's work and influenced by experience working with SQL and dplyr at big data scale. Such piped operator notation allows even those unfamiliar with SQL to build, test, use, and maintain powerful data processing pipelines at big data scale. The rquery system includes simple and regular rules for building up data processing pipelines from basic primitives and includes powerful operations such as SQL window functions. rquery is a "query first" package where both data processing pipelines and SQL queries are inspectable objects. rquery has proven to generate high-performance queries and be reliable in managing complex data workflows at scale. John Mount is a data scientist working for the consulting firm Win-Vector LLC. He is one of the authors of the popular data science book "Practical Data Science with R" (Zumel, Mount; Manning 2014) and a frequent author and speaker on machine learning and data science topics. He is also a frequent contributor to the popular Win-Vector technical blog: http://www.win-vector.com/blog/ . ############## Dave Hurst Double dribble: Two practical use cases for the googledrive package googledrive is a tidyverse package that allows you to interact with files on Google Drive from R. We'll explore practical use cases that demonstrate the usefulness of the package as well as the basics of working with dribbles and listcols. Dribbles are an interesting example of 'rectangling' data that may not easily fit into a data frame, so many of the techniques used by the package have broader application for handling data in R.

    Intuit, Building 20

    2600 Marine Way · Mountain View, CA

    6 comments
  • Official April BARUG Meetup
    This meetup is being sponsored by Microsoft Developer Advocates and RStudio. Agenda 6:30 - Pizza and networking 6:55 - Announcements 7:00 - David Smith: Speeding up computations in R with parallel programming in the cloud 7:20 - Dan Putler: Locating Opioid Treatment Centers in Under Served Areas Using R and Alteryx 8:15 - Ali Zaidi, Bob Horton, Mario Inchiosa, and Omar Alonso: Labeling Data on a Budget with Active Learning and Transfer Learning ############# David Smith Speeding up computations in R with parallel programming in the cloud There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and grid-based computations are just a few examples. In this talk, I'll provide a review of tools for implementing embarrassingly parallel computations in R, including the built-in "parallel" package and extensions such as the "foreach" package. I'll also demonstrate how you can reduce the time for a complex computation -- optimizing hyperparameters for a predictive model with the "caret" package -- by using a cluster of parallel R session in the cloud. With the "doAzureParallel" package, I'll show how you can create a cluster of virtual machines running R in Azure, parallelize the problem by registering backend to "foreach", and shut down the cluster when the computation is complete, all with just a few lines of R code. ############# Dan Putler Locating Opioid Treatment Centers in Under Served Areas Using R and Alteryx In 2016 there were nearly 64,000 drug overdose deaths in the United States, the lion's share of these due to opioid abuse. It is now recognized that America is facing an Opioid epidemic. Treatment of opioid addiction is one of the primary tools available for addressing the epidemic. However, many of the areas hardest hit by opioid use are believed to be under served from a treatment perspective. One issue currently impeding the location of treatment facilities is the lack of fine grained data associated with the location of individuals who abuse opioids. This talk presents an app that is designed to assist public health officials and others in locating opioid treatment facilities in under served areas. Estimates of the number of adults who abuse opioids at the census tract level are developed using R along with data from the U. S. Department of Health and Human Service's National Survey on Drug Use and Health and both census tract level data and the microdata sample from the U. S. Census Bureau's American Community Survey. The census tract level estimates of adult opioid abusers is used, along with data on the locations of existing opioid treatment facilities, to locate new facilities in areas that are further than ten miles from existing facilities, and maximize the estimated number of abusers within a ten mile radius of the new facilities. The optimization is done using an evolutionary algorithm that is implemented in Alteryx. The resulting application can then be deployed on the web via the Alteryx Gallery. ######## Labeling Data on a Budget with Active Learning and Transfer Learning Ali Zaidi, Bob Horton, Mario Inchiosa, and Omar Alonso Many organizations have access to large amounts of data, but find it challenging to train supervised machine learning models because it is laborious or expensive to label the training cases. We will work through two simplified examples of active learning, one with text classification and one with image classification. By iteratively labeling small numbers of cases and building models to aid in selection of additional cases to label, we can build much higher performance models from a given number of cases than we would be likely to achieve if we randomly chose cases to label. For both the text and image classifiers, we will use a type of transfer learning to represent each case as a vector of floating point numbers that serve as low-complexity features in conventional machine learning algorithms. These examples will serve as a basis for a broader discussion of efficient approaches to obtain crowdsourced labels at scale.

    Microsoft Reactor

    680 Folsom Street · San Francisco, CA

    14 comments
  • Official March 2018 Meetup
    This event was canceled.
  • March 2018 Meetup
    Agenda: 6:30 Pizza and networking 7:00 Announcements 7:05 Pete Mohanty: Analyzing rtweet data with kerasformula 7:35 J.J. Allaire: Machine Learning with TensorFlow and R ################################## J.J. Allaire Machine Learning with TensorFlow and R TensorFlow is an open-source software library for numerical computation using data flow graphs. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well. In this talk we'll cover the R interface to TensorFlow (https://tensorflow.rstudio.com), a suite of packages that provide high-level interfaces to both deep learning models (Keras) and standard regression and classification models (Estimators), as well as tools for cloud training, experiment management, and production deployment. ################################## Pete Mohanty Analyzing rtweet data with kerasformula Now on CRAN, kerasformula offers a high-level interface to keras for R. Many classic machine learning tutorials assume that data come in a relatively homogenous form (e.g., pixels for digit recognition or word counts or ranks) which can make coding somewhat cumbersome when data is contained in a heterogenous data frame. kerasformula takes advantage of the flexibility of R formulas to smooth this process. kerasformula builds dense neural nets and, after fitting them, returns a single object with predictions, measures of fit, and details about the function call. kerasformula accepts a number of parameters including the loss and activation functions found in keras. kerasformula also accepts compiled keras_model_sequential objects allowing for even further customization. This talk introduces this library and shows how it can aid is model building and hyperparameter selection (e.g., batch size) starting with raw data gathered using library(rtweet). Everything's based on: https://tensorflow.rstudio.com/blog/analyzing-rtweet-data-with-kerasformula.html

    Google

    1175 Borregas Ave, Sunnyvale, CA 94089 · Sunnyvale, CA

    25 comments
  • Official February BARUG Meetup
    Agenda: 6:30 Pizza and Networking 7:00 Announcements 7:05 Nicholas-Lewin-Koh - A brief introduction to the power of multistate models (in R) 7:30 Ali Zaidi - Reinforcement learning in Minecraft with CNTK-R Ali Zaidi 7:50 Nicholas Tierney: "Why is that missing?": Practical tools for missing data. ############################### Nick Tierney "Why is that missing?": Practical tools for missing data Missing data are part of doing data analysis in the real world. Unfortunately, there are a surprising number of gotcha's when dealing with missing data - weird missing values like "N / A", "NOT avail.", and "-99" might crop up, missing values might disappear from visualizations, and statistical models sometimes don't handle or even drop missing values. Knowing how to identify where missing values are, how extensive the missing data problem is, and why the values are missing is crucial to making the right inference from your data and models. In this talk, I discuss principles and tools for working with missing data in R, to take you from identifying where missing values are, to exploring why data is missing, and investigating imputation methods.

    Capital One - SoMa

    201 3rd Street · San Francisco

    13 comments