[EVENTBRITE] Predictive Data Science in R & How to be a Successful Consultant

This is a past event

67 people went

Intel

2200 Mission College Blvd · Santa Clara, CA

How to find us

Parking and Entrance: see instructions below. PRE-LOADING SOFTWARE: see instructions below.

Location image of event venue

Details

TICKETS $[masked], SIGN UP THROUGH EVENTBRITE

https://www.eventbrite.com/e/predictive-data-science-in-r-tickets-35366885306?aff=MeetupSFbayACM

Tickets $[masked]. (For single tickets, the price is $150 from 7/26 to 9/15 at 3pm. After 9/15 at 3pm, the price for any ticket is $195. Before 9/15, if you sign up 2-6 people at a time, the price is $130 / person.)

We are seeking TA's who know R to help the audience. TA applicants should contact the instructor in advance. Use the [contact] button on the left, send email, phone, LinkedIn and R experience).

PARKING AND ENTRANCE:

Parking: Enter Intel main entrance at 2200 Mission College Blvd. Turn right immediately upon entering and park in visitor parking. (You may also park in Garage B which you can enter from near the corner of Mission College Blvd and Juliette Lane.)
Entrance: We are working to have direct entrance close to the classroom. Once you park, go to the Star on the attached map - look for the ACM signs.

Otherwise, if you cannot find that entrance, go to the second floor of Garage B and walk over the short concrete skybridge to the Employee Entrance of building SC-9 and ask for the ACM Workshop.
Here's a map you might find useful. https://www.flickr.com/photos/joshb/320803384

PRE-LOADING: BEFORE THE CLASS, PREPARATIONS:

The class uses RStudio, the IDE which is what you would use for typical R data mining projects at work.

• This UCLA R Studio Tutorial link (http://web.cs.ucla.edu/~gulzar/rstudio/index.html) documents the following steps, which be helpful before you come to the class. It is recommend to go over both the Installation and the short Basic Tutorial (if you don't already have this knowledge).

• Install R 3.3.3 or later https://cran.r-project.org/

• Install RStudio, Desktop (https://www.rstudio.com/products/rstudio/) IDE (free)

• If you install on Windows, it is strongly recommend you use this link to enable R to use your available memory (https://stackoverflow.com/questions/1395229/increasing-or-decreasing-the-memory-available-to-r-processes), with --max-mem-size=xxxxMB. Install devtools package.

• Install R libraries: data.table, Hmisc, gmodels, e1071, doMC (if you are on a Mac or Unix), doParallel (if on Windows), caret, rpart, randomForest, partykit, pROC, nnet, xgboost, ggplot2, zoo. (Check a week before the class, the list may get updated).

8 HR CLASS - SUMMARY (detailed outline follows) Go through a sprint of a predictive data mining project, introducing R as we go. Review the training process for regression, backpropagation neural nets, decision trees and XGboost. Introduce R data.tables and the caret interface to 233 predictive algorithms. Focus on strategies to structure a successful project design and data pull. Review a variety of preprocessing and knowledge representation. Provide questions you can take away and apply to the design of your future projects, to describe models to clients (sensitivity analysis code included) and to manage models over their natural lifecycle. Introduce R + Spark integrations, and show an example R Shiny web GUI interface.

TARGET AUDIENCE would include people who ...

• are comfortable programming

• may already work on consulting projects or in some technical business problem solving role.

• It is helpful if you have tried R, or some basic exposure to R before the class can help. The focus is much more on "being successful with deploying Data Mining".

COURSE DESIGN: The instructor does not want to repeat "R in a Nutshell" or training that goes "sequential and broad" (i.e. everything about data structure X, then everything about feature Y). That material is great for a larger training time frame. For students to get the most out of a one day class, Ithe instructor is focusing on a "narrow" path, like a project sprint, going through a complete set of steps in a data mining project. Many pointers will be provided to invite you to broaden your skills more after the class.

The instructor likes the Covey quote "If the ladder is not leaning against the right wall, every step we take just gets us to the wrong place faster." (http://philosiblog.com/2015/12/17/if-the-ladder-is-not-leaning-against-the-right-wall-every-step-we-take-just-gets-us-to-the-wrong-place-faster/) A successful data mining project is not just coding and executing a function. Design is crucial. There is a gap that is not covered by Kaggle experience or starting with a ready-made data set. The instructor focuses on covering general strategies that you can take away as questions you can ask about your upcoming project, such as how to identify projects, how to structure a project for success.

CLASS DETAILED OUTLINE

Part 1: Get started and play with your data

Overview (and Lab 1a) of R studio, basics of variables, lists, read a CSV file into a data table, find out the ways to look and manipulate the table. Discuss the HMEQ (Home Equity) data. The problem is to predict if the person would be good or bad loan.

Discuss a comparison / contrast for a few data mining algorithms: Regression, neural nets, decision trees, XGboost and ensemble models. Train a first decision tree on an existing training set (Lab 1b). Go to TensorFlow Playground (http://playground.tensorflow.org/) to try setting some neural net parameters and training them on different data set.

Part 2: Data Science Project Design

• Model Evaluation Fundamentals, DS Model Loops or sprints

• Selling Data Mining to Executive Check Writers - assessing the upside of opportunities or the problem size

• Finding candidate projects with a Knowledge Discovery Workshop

• Data Mining Project Design and Objectives (accurate, general, understandable)

• Designing the training data to represent the production scoring data in the future.

• Retraining Frequency (daily or re-evaluate monthly)

• Reference Dates (separate analysis past from future)

• Target and Weight Variable Variations

• Business Metrics to Optimize, lift tables

• Big Data Production, Lambda and Kappa Architecture

• R data.tables lecture/Lab 2 on the HMEQ data table. Show the analogy with SQL, selecting rows, creating columns, aggregation. Writing a small function, R macros get you unstuck and help scale in complexity.

Part 3: Preprocessing Design - Simple to Complex

• Review math requirements on input data - by algorithms. Focus on preparing a data set that can get loaded in most any algorithm.

• Missing data handling (simple to sophisticated)

• Convert rules, queries or func. to detector fields [0…1] to capture “use cases” of behavior

• Convert observed frequency of “normal” to “rareness” detectors - for fraud detection.

• Lab 3: preprocessing your HMEQ data

• Fit linear models to time series within a record to extrap.

• Time series: detect individual past behavior to adapt future estimate

• Don’t ignore input variables with 20+ categories, use DBC (Dependent by Category)

• Variable interactions: not A*B, DBC tables, clusters

Part 4: Modeling Design, Spark

Model Notebook to track, plan design of experiments and to automate

Sensitivity Analysis: describe models (https://www.slideshare.net/gregmakowski/predictive-model-and-record-description-with-segmented-sensitivity-analysis-ssa) or model ensembles overall. Provide record level reasons. Explain how to detect model drift over time, and describe why.

Lab 4: training models, evaluating, run sensitivity analysis with provided sensitivity code.

Discuss additional topics: Review available R + Spark combinations. (Apache SparkR (https://spark.apache.org/docs/latest/sparkr.html), RStudio's SparklyR (http://spark.rstudio.com/), IBM's R4ML (https://github.com/SparkTC/r4ml)). Time permitting, discuss R web GUI's with Shiny & Shiny Dashboards, RStudio's TensorFlow for R (https://tensorflow.rstudio.com/).

BEFORE THE CLASS, PREPARATIONS:

The class uses RStudio, the IDE which is what you would use for typical R data mining projects at work.

• This UCLA R Studio Tutorial link (http://web.cs.ucla.edu/~gulzar/rstudio/index.html) documents the following steps, which be helpful before you come to the class. It is recommend to go over both the Installation and the short Basic Tutorial (if you don't already have this knowledge).

• Install R 3.3.3 or later https://cran.r-project.org/

• Install RStudio, Desktop (https://www.rstudio.com/products/rstudio/) IDE (free)

• If you install on Windows, it is strongly recommend you use this link to enable R to use your available memory (https://stackoverflow.com/questions/1395229/increasing-or-decreasing-the-memory-available-to-r-processes), with --max-mem-size=xxxxMB. Install devtools package.

• Install R libraries: data.table, Hmisc, gmodels, e1071, doMC (if you are on a Mac or Unix), doParallel (if on Windows), caret, rpart, randomForest, partykit, pROC, nnet, xgboost, ggplot2, zoo. (Check a week before the class, the list may get updated).

• For fun, play around with some neural nets at the TensorFlow Playground (http://playground.tensorflow.org). This will be covered in the class as well.

• You are invited to submit a description of your upcoming predictive projects or vertical. The instructor will review and may try to incorporate some ideas in the class. Through the meetup site, on the left margin, use the [contact] button.

SCHEDULE

8:00 - 8:30 arrive, register, coffee, network

8:30 - 10:30 lecture / lab

15 min break, coffee

10:45 - 12:45 lecture / lab

45 min break for lunch

1:30 - 3:30 lecture / lab

15 min break, coffee, small snacks

3:45 - 6:00 lecture / lab

15 min Q&A

ABOUT THE SPEAKER:

Greg Makowski (https://www.linkedin.com/in/gregmakowski/) has been deploying data mining models for 25 years (before the terms Data Science or Data Mining) as the "neural net guy" at American Express/Epsilon. He likes to "begin with the end" with the business decisions and values to be made by the analytic system, the job function to be complemented and by the deployment constraints. He has developed the analytic internals and automation for 6+ enterprise software systems or SaaS systems. His first convolutional neural net was trained in 1991, a Time Delay Neural Net for speech recognition. Vertical experience includes financial services (credit card, retail banking, bond pricing, ACH payments, fraud detection, customer relationship management (mail, phone, email, banner), retail supply chain among others. He always has something to learn from everybody.