”Know Your Data!” Lessons Learned
Details
We start Y-DATA talk series called Data Driven Products. This series of standalone lectures brings in Data Science and Machine learning experts from the industry to talk about the "real-world" aspects and challenges of creating ML-based products. Each week a different guest lecturer introduces their company and its ML-based products and speaks in detail about various technical and product topics and challenges inherent to their work. Those talk aim to familiarize the community with the day-to-day challenges of data science work in different domains and introduce them to some of the leading tech companies in the field, their products and their unique features.
The talks are given to Y-DATA students at TAU campus as part of Fall semester curriculum. We open most of these lectures to our meetup community via zoom livestreaming.
Talk #3
19/11 12:15-13:15
Knowing Your Data is a crucial factor for Machine learning. We all familiar with the term “Garbage in, Garbage out” (or GIGO for short) originated in the statistics and data science fields to illustrate the fact that the quality of the output received from a ML model depends greatly on the quality of the information that was input. If your data is not valid or accurate, your results are worthless. “Garbage data” can be data that is simply filled with errors, outliers, missing values and artifacts but it can also be data that doesn't have any applicability.
The solution is to take out your data trash! by spending less time on “fit/predict” but spending more time on crunching and validating the input data to ensure that the right sort of data goes into the model. In this talk I will tackle this problem of data integrity for Machine learning purposes. I will go over some highly recommended data-driven methodologies and best practices to ensure the quality of the training data for ML modeling. I will present several use-cases from my experience demonstrating the simplest artifacts in data to the more complex and promiscuous ones.
Target audience:
Advanced beginners to senior level data scientists, machine learning practitioners and SW developers curious about data science and machine learning. Basic python code reading capability is highly recommended.
Speaker:
Gershon Celniker is Lab Group Manager at General Motors, previously Head of Data and AI research at CLEW and led ML and Data-driven research at Check Point, Verint and Wiser. He holds a B.Sc. from Technion Institute, M.Sc. from Hebrew University in Bioinformatics and Machine Learning applications. Gershon has a vast academic experience as a CS researcher from Weizmann institute and Tel-Aviv University in which he took part in development of the very popular bioinformatics products: GeneCards and ConSurf. Currently his main areas of research interest lie in the design of ML algorithms and their applications in Autonomous driving, AI Explainability and Behavior analytics.
