Selection Bias and Causal Inference
Details
Y-DATA Meetup #9
Selection Bias and Causal Inference
Hosted by SimilarWeb
Talks are in English
Intro:
Data makes the world go round, in recent years more than at any time before. As more and more products and processes become data-driven, the importance of the exact nature of the data used, grows ever larger; At the same time, the potential harm from choosing the wrong data sources or sampling the existing data incorrectly keeps increasing as well. Any ML project or data-intensive process needs at some point to examine its data for potential biases and find ways to overcome them.
In this meetup we will look into the issue of selection bias, and how to account for hidden issues in the sampling and data collection processes. We will discuss how to approach data gathered from multiple sources - from website engagement metrics to observational healthcare data, to locate and overcome inherent biases and to estimate a robust metric to use with multiple data sources which won't be susceptible to the biases contained therein.
More info about Y-DATA is here: bit.ly/ydata-website
Previous meetups videos are here: bit.ly/youtube-ydata
Agenda:
18:00 - 18:30 Registration, Mingling, Snacks & Beer
18:30-18:35 - Greetings from SimilarWeb, Oded Vainas - Head of Research and algorithms
18:35 - 19:15 Talk 1: "Not All Samples Were Born Equal" - Shuki Cohen, Data Scientist at SimilarWeb
19:15 - 19:30 Break
19:30 - 20:15 "From correlation to causation in healthcare data" - Ya'ara Goldschmidt, Director for Data Modelling at K Health
-----------
Talk Details:
Talk #1:
Not All Samples Were Born Equal
Abstract:
Have you ever wondered how to estimate a metric from various biased data sources? To address this issue, we will start by presenting a naive solution to the problem and continue with a more advanced option based on the field of causal inference. Specifically, we will show the intuition and mathematics of propensity score and how to use it to account for the aforementioned biases. The talk ends with some practical tips and tools of what to do when encountering similar problems.
Bio:
Shuki is an experienced Data Scientist at SimilarWeb, in the Data Science & Big Data group. He is co-founder and leader of the AI community of Jerusalem, named JerusML. Has a B.Sc in Industrial engineering from BGU and M.Sc from TAU specialized in Data Science. Shuki enjoys using machine learning algorithms that unveil surprising insights on human nature.
Talk #2:
From correlation to causation in healthcare data
Abstract:
Observational data gathered during the practice of medicine in electronic health records poses a great promise for developing AI-driven applications. At K Health we analyze over 400 million doctors’ notes. We are using natural language processing to pick up relevant symptoms. Using advanced modeling, the machine automatically learned to understand the connections between symptoms, time, patient’s age, gender, and other biological factors. In order to do so, we have to overcome the inherent biases in healthcare observational data. I will present the approach and the data challenges we face in building AI models from observational data in healthcare along with some current research directions using causal inference.
Bio:
At K Health Ya'ara leads a multidisciplinary team of scientists and engineers who are experts in health informatics and machine learning.
She holds a PhD from the Weizmann Institute of Science, in the areas of computer science and computational biology. In her last role in IBM, Ya'ara served as Manager and Senior Technical Staff Member, where she has initiated and led a number of projects to collaborate with clinical domain experts on massive volumes of healthcare data. These projects were geared towards developing cutting-edge ML and statistical tools that are tackling an array of industry challenges in healthcare.
