- Teaching Lab R&D Chemists using RMD and Shiny with Mukul Mehta
Lab R&D Chemists in Chemical, Food, Plastic, Polymer and Pharma industries are very creative and constantly develop new products over time. Their R&D Managers worry, that R&D is very slow and struggle to speed up NPD. Many big companies like DuPont, Dow, BF Goodrich Chemical and others have tried to bring new statistical and mathematical methods in-house, but without much success. Mukul Mehta will examine several problems he has identified and an R/Shiny based solution that he is developing to address these problems. Mukul hopes to train 10,000 Lab R&D chemists over three years.
- Real-time file import with the vroom package (Jim Hester)
Basement Theatre of Crown Centre 1 Building
File import in R could be considered a solved problem, with multiple widely used packages (data.table, readr, and others) providing fast, robust import of common formats in addition to the functions available in base R. However I feel there is still room for improvement in existing approaches. vroom is able to index and then query multi-Gigabyte files, including those with categorical, text and temporal data, in near real-time, parsing at over 1 Gb per second. This is a huge boon for interactive data analysis as you can jump directly into exploratory analysis without sampling or long waits for full import. vroom leverages the Altrep framework introduced in R 3.5 along with lazy, just-in-time parsing of the data to provide this improved latency without requiring changes to existing data manipulation code. I will thoroughly explain the techniques used in vroom to ensure good performance, describe challenges overcome in implementing it, and provide an interactive demonstration of its capabilities. vroom is on CRAN now, install it with `install.packages("vroom")` and learn more about the package at https://vroom.r-lib.org
- Contributing to Open Source with John Blischak
Are you interested in contributing to open source software projects, but unsure how to get started? With a focus on the R software ecosystem, I’ll describe open source software development, provide motivations for contributing to open source software, give some advice for getting started in open source, and detail some of the technical steps for making a contribution.
- Survey and Measure Development in R with George Mount
This will be an introduction to the survey and measure development process with emphasis on applications in R. We will first explore, visualize and clean self-report survey data, then walk through exploratory factor analysis. Finally, we will formally test our measurement model and use it to predict outside variables of interest. George is an independent analyst and blogger at georgejmount.com. He serves as a technical expert and mentor for Thinkful’s data analytics program and is the instructor of DataCamp’s “Survey and Measure Development in R.”
- Data Science Remote Workstation with Tim Hoolihan
This talk will walk through the process of setting up a remote machine to do machine learning or host jupyter notebooks. Tips will be useful for a desktop you want to access on the go, or cloud machine.
- Advancements in Deep Learning Privacy and Encryption with Jason Mancuso
In the future, our most impactful applications will rely on our most sensitive data such as medical records, financial statements, location history, and voice transcripts. If this data were to end up in the wrong hands it could be disastrous to the individual, erode trust in the company, and lead to legal implications with the rise of policies such as GDPR. However, with recent advances in computing power and cryptographic techniques data privacy and machine learning no longer need to be adversaries. In this talk, we introduce the importance of data privacy for advancing machine learning applications across healthcare, finance, and transportation. We discuss the many different technologies enabling this future such as differential privacy, secure multi-party computation, garbled circuits, and how they can be used to train and deploy secure, privacy-preserving machine learning models. Finally, we demonstrate how you can use these technologies today using tf-encrypted (https://github.com/mortendahl/tf-encrypted), an open source library built on-top of TensorFlow for secure, privacy-preserving machine learning. Jason is a research scientist at Dropout Labs, the founder of the Cleveland AI Group, and a member of the AI Village at DEFCON and OpenMined communities. He works on novel methods making machine learning more performant for privacy-preserving techniques like secure computation and differential privacy, most notably by contributing to the tf-encrypted project. He has worked on a variety of safety and security problems, including safe reinforcement learning, secure and verifiable agent auditing, and adversarial machine learning. His contributions to the OpenMined project formed the foundation of the current version of PySyft, a generic platform for privacy-preserving machine learning. His previous work with the Cleveland Clinic established a state-of-the-art in blood test classification and demonstrated that machine learning can virtually eliminate the problem of medical malpractice due to contaminated blood samples. He graduated from John Carroll University with a B.Sc. in Mathematics.
- Introduction to debugging R and RStudio with Jim Hester
Bring your laptop as we learn about the available debugging tools in R, then try them out on some code examples in an interactive workshop. (Materials adapted from a section of the 'What They Forgot to Teach You About R' workshop Jim co-taught with Jenny Bryan at RStudio::conf 2019)
- Facilitating Reproducible Academic Manuscript Development: The 'projects' Pkg
The `projects` R package introduces a dedicated workflow for academic research manuscripts, assisting researchers in their efficiency and organization in manuscript development. It supplies an intuitive file structure for storing raw and cleaned study data sets, and it provides customizable R Markdown templates for data analysis and manuscript development. Considering the growing emphasis that the contemporary scientific community places on reproducibility, the `projects` R package was created in the interest of facilitating reproducible research, particularly for team science. Nik Krieger was born and raised in North Royalton, Ohio. He obtained a Master of Science in Biostatistics from Case Western Reserve University last year and began work as an R-focused data scientist at the Cleveland Clinic upon graduation. He is on the Northeast Ohio Cohort for Atherosclerotic Risk Estimation (NEOCARE) investigative team, led by Drs. Jarrod Dalton and Adam Perzynski, which focuses on the impact of social determinants of health on the risk of atherosclerotic cardiovascular disease.
- It depends: A dialog about dependencies by Jim Hester
It depends (Jim's talk) Software dependencies can often be a double edged sword. On one hand they let you take advantage of others' work, giving your software marvelous new features and reducing bugs. On the other hand they can change, causing your software to break unexpectedly and increasing your maintenance burden. These problems occur everywhere, in R scripts, R packages, Shiny applications and deployed ML pipelines. So when should you take a dependency and when should you avoid them? Well, it depends! This talk will show ways to weigh the pros and cons of a given dependency and provide tools for calculating the weights for your project. It will also provide strategies for dealing with dependency changes, and if needed, removing them. We will demonstrate these techniques with some real life cases from packages in the tidyverse and r-lib. vctrs & type-stability (Hadley's talk) vctrs is a new package that provides tools (cognitive and computational) to ensure that functions behave consistently with respect to inputs of varying length and type. The end goal of vctrs is to be invisible to the end user of the tidyverse (simply enabling their predictions about function outputs to be more correct), but will help developers write functions that "just work".
- John Blischak: Introduction to the conda package manager for R users
Have you ever been excited to try a new software package only to end up frustrated after spending hours getting it to install correctly? How often do you return to a project from a few months ago and the code is broken due to software updates? Do you and your collaborators struggle to run each others' code on your own computers? Is your GitHub Issue tracker full from users unable to install your software on operating systems that you are unfamiliar with? Do you ever wish you could test out the latest version of a software package without affecting your current setup? If you've ever faced any of these challenges, there is a better way to install and manage scientific software across your various projects. Conda (https://conda.io/docs/) is a cross-platform, language-agnostic package management system and environment management system. And it's not just for Python packages! Conda can also manage R packages and many other dependencies you might need for your data science projects. I will cover how conda compares to other package and environment management systems, how to setup conda and install packages, how to create isolated computational environments for individual projects, and how to build and share your own conda packages.