6:30 PM - Pizza and networking
7:00 Jeffrey Flint: Modeling the SF Municipal Bus System
7:20 - Dan Putler: Using R and Alteryx to Uncover
the Dimensions of Movie Ratings
7:50 - John Mount: Some statistics for data scientists:
issues you can and can not ignore
8:20 - Jim Porzak: Customer Segmentation with R
Municipal bus systems, at least in San Francisco, are notorious for being unreliable and at the same time very expensive to operate. This presentation investigates how to solve both of these problems by using a demand-based scheduling algorithm. Monte-Carlo simulation is used to compare the performance of a demand-based bus scheduling algorithm with that of the conventional, periodic-based bus scheduling algorithm. Samples of the simulation are presented as an animation, using the R packages ggplot2 and shiny. Comprehensive static statistical visualizations are also presented using these packages
Dan Putler, Chief Scientist at Alteryx
Objective attributes of movies (genre, length, star power, etc.) are not good predictors of movie ratings. In this talk, we will show how R combined with Alteryx can be used to uncover a set of more subjective (perceptual) measures that can be used to predict movie ratings with a high degree of accuracy. To do this, we make use of multiple data sets (the [MovieLens]( http://grouplens.org/datasets/movielens/ ) dataset of individual level reviews from "citizen" reviewers and aggregated professional reviews, for both all reviewers and top reviewers, gathered from [Rotten Tomatoes]( http://www.rottentomatoes.com/) ). The underlying analysis consists of three steps: (1) using the MovieLens citizen reviewer data from individuals for the 200 most frequently rated movies in that dataset, we construct a dissimilarity matrix of the movies using the adjusted cosine distance algorithm; (2) the dissimilarity matrix is then used in a multidimensional scaling (MDS) algorithm, and the most important dimensions are extracted; and (3) the extracted MDS dimensions are then used as predictors in a set of predictive models where the target variable is the aggregated Rotten Tomatoes rating for movies, and the efficacy of different models is compared using out of sample data. The talk concludes with a discussion on how our approach can be implemented in practice.
"Some statistics for data scientists: issues you can and can not ignore.”
I’ll talk about a few classic statistical issues and try to point out which ones remain critical in the age of data science, and which ones we can worry a bit less about when we have large data sets. I’ll end with some tempting machine learning axioms that most data scientists wish were true.
John Mount is a Principal Consulting at Win-Vector LLC, a data science consulting company. He is one of the authors of “Practical Data Science with R” and a popular writer and speaker on data science concepts and foundations.
This talk is a deep dive into using Fritz Leisch's flexclust package to segment customers based on check mark type surveys or equivalent customer usage patterns. For this kind of data a non-geometric distance measure is needed. We demonstrate the use the Jaccard distance measure. We also address three practical problems when using flexclust to segment customers: 1) the numbering problem - when starting seeds are different, cluster ordering will differ; 2) the stability problem - different starting seeds may result in quite different clusters; and 3) the "best" k problem - how do we know how many clusters to ask for?
Jim Porzak is a retired data scientist who has been using data to understand customers for the last dozen, or so, years. See his site ds4ci.org for archives of his many talks here and in Europe.