DataTalks #34: Advanced Data Validation ✔️💾


Details
DataTalks #34: Advanced Data Validation✔️💾
Our 34th DataTalks meetup will be hosted by Amazon Web Services on Floor28, and will focus on advanced topics in data validation for machine learning! You must register at the link to attend!
Registration: https://aws-experience.com/emea/tel-aviv/event/b827f5f7-0593-4680-a00f-235d067a345b
𝗔𝗴𝗲𝗻𝗱𝗮:
🍕 18:00 - 18:30 - Mingling, etc.
🔶 18:30 - 19:10 – Organizing your data sets for machine learning
🔷 19:15 - 20:00 – Validation & testing techniques through the phases of a DS project
***
Talks #1: Organizing your data sets for machine learning (versioning, validation and feature stores)
Speaker: Julian Sprung, AI/ML Specialist Solution Architect, AWS
Abstract: In this session we will look at the challenges and strategies to organize and manage your data sets for machine learning training and inference.
While code versioning and reproducible software builds are widely adopted, reproducible machine learning models require additional efforts to track, standardize, version and manage the data sets used for training as well as ensure the same conventions are applied during inference.
In the first part, we will look at data set versioning approaches such as manifest files and tools such as git LFS or Data Version Control (DVC).
In the second part we will look how the concept of a feature store fits into the picture and how they can help your teams to build reusable data repositories with companywide standards, conventions and validations. Feature stores also provide means for ML linage tracking, point in time feature time travel, feature discovery and feature sharing.
Last we will have a quick look at the feature store landscape and walk through a quick feature store demo with Amazon SageMaker Feature Store.
***
Talks #2: Validation & testing techniques through the phases of a DS project
Speaker: Aviram Berg, AI/ML researcher, former DS @ Weizmann Institute of Science
Abstract: In this session, we will cover different methods for testing and validating your data from experiments to production on structured and unstructured data.
Data is the core of every decision-making process, thus a data-centric company can better perform its strategy in alignment with the stakeholders' interests. While the above is almost a consensus, companies don't validate enough their data and still use a model-centric validation (such as a confusion matrix). After talking with ~40 Head of Data Science of leading companies, I will share the best practices in validating and testing the data across the different project phases.
In the first part of the lecture, we will cover the pro & cons of the leading tools in each category. Testing tools such as dbt, anomaly detection, and validation tools such as Anodot or Monte Carlo. Also, how to apply data validation methods to unstructured data by synthetic data generation.
In the second part, we will fit those tools into different pipelines that are supposed to serve different purposes.
Examining the challenges of connecting them together and choosing the right tools for your mission.
***
Registration using the link below is free but mandatory!
https://aws-experience.com/emea/tel-aviv/event/b827f5f7-0593-4680-a00f-235d067a345b
〰️〰️〰️〰️〰️〰️〰️〰️〰️〰️〰️〰️〰️〰️〰️〰️〰️〰️〰️〰️〰️〰️
Also... on the same date (June 21st) just before the meetup at 14:00 - 17:00 we will host a SageMaker Hands-on Workshop
You are more than welcome to join us (Please register first):
https://aws-experience.com/emea/tel-aviv/event/3cf102ad-1ca0-498c-8ac2-a4756ec29731
This is a hands-on workshop where you will learn how to use Amazon SageMaker while solving real-world ML challenges. In the workshop we will cover end-to-end ML workflow, from data preparation and feature engineering, tracking and management of the training process to deployment of the model as a REST API. The workshop will be using Python programming language and some basic Machine Learning knowledge will be expected.
Who Should attend:
Data Scientists, ML Engineers, Developers
Pre-knowledge / Requirements:
A computer with access to high-speed internet will be needed. It should have a modern browser (recent Firefox or Google Chrome recommended). Basic knowledge about Python and machine learning is expected but not strictly required.
Agenda:
Introduction and Overview of Amazon SageMaker (20 mins)
Lab Module 1-2 (30 mins): Data exploration and feature engineering
Visual data (pre-)processing with SageMaker Data Wrangler (10 mins)
Lab Module 3-6 (60 mins): Building, training, deploying and invoking the machine learning model
Questions and Answers (15 mins)

DataTalks #34: Advanced Data Validation ✔️💾