Lessons from a Scikit-Learn CoFounder and Fairness Classifiers

PyData Amsterdam
PyData Amsterdam
Openbare groep
Locatieafbeelding van evenementslocatie

Wat we doen


18:00 food
19:15 first talk
20:00 break
20:15 second talk
21:00 networking

Gael Varoquaux, a co-founder of scikit-learn, will drop by to talk about democracy in machine learning while we have our own Matthijs talk about fairness in machine learning. It should be a fun evening for all who love to build pipelines in scikit learn.

**Machine learning on non curated data: Dirty data made easy**

According to industry surveys [1], the number one hassle of data scientists is cleaning the data to analyze it. Textbook statistical modeling is sufficient for noisy signals, but errors of a discrete nature break standard tools of machine learning. I will discuss how to easily run machine learning on data tables with two common dirty-data problems: missing values and non-normalized entries. On both problems, I will show how to run standard machine-learning tools such as scikit-learn in the presence of such errors. The talk will be didactic and will discuss simple software solutions. It will build on the latest improvements to scikit-learn for missing values and the DirtyCat package [2] for non normalized entries. I will also summarize theoretical analyses in recent machine learning publications.

[1] Kaggle, the state of ML and data science 2017 https://kaggle.com/surveys/2017
[2] https://dirty-cat.github.io/stable/

**Speaker Bio**
Gaël Varoquaux is an Inria faculty researcher working on data science and brain imaging. He has a joint position at Inria (French Computer Science National research) and in the Neurospin brain research institute. His research focuses on using data and machine learning for scientific inference, applying it to brain-imaging data to understand cognition, as well as developing tools that make it easier for non-specialists to use machine learning. Years before the NSA, he was hoping to make bleeding-edge data processing available across new fields, and he has been working on a mastermind plan building easy-to-use open-source software in Python. He is a core developer of scikit-learn, joblib, Mayavi and nilearn, a nominated member of the PSF, and often teaches scientific computing with Python using the scipy lecture notes.

**Pipelines for Fairness: A Convexing Usecase**
Machine learning is increasingly being used to automate decision making. Though if you’ve been active with machine learning long enough, you’ve probably seen your models make a mistake or two. These mistakes are generally just silly mishaps, but depending on the domain
you’re in, they can be downright scary. The goal of minimising the error on your test set is often just one of several real-world goals we might want to achieve. To me, improving how we can tailor our models to achieve these goals, is one of the most interesting aspects of being a data scientist.

For one of these real-world goals, namely that of fairness, I will discuss several paths we can go down to better tailor our models.
Rather than discussing fairness on a high-level, we’ll do a deep dive into the implementation of several methods that you can apply right
away to hopefully improve the fairness of your models. In particular, we will discuss:

1. Measuring fairness
2. Debasing your dataset;
3. Constraining the unfairness of your model;
4. Applying post-processing on your models' output.

As well as how to implement these in scikit-learn pipelines. I hope that after this talk you will have gained some new insights on
tailoring your models, even outside of the domain of fairness.

**Speaker Bio**
Matthijs is the Data Science Lead at Xccelerated, where he is responsible for developing and teaching
their data science curriculum. Next to that, he likes working on open source tools like scikit-lego. As of
last year, he is the co-chair of the PyData Amsterdam conference.