Skip to content

Training Data—The Overlooked Area of Modern AI

Photo of Kostya Kilimnik
Hosted By
Kostya K.
Training Data—The Overlooked Area of Modern AI

Details

Y-DATA Meetup #21
Training Data—The Overlooked Area of Modern AI
Meetup is also broadcasted on Zoom: https://yandex.zoom.us/j/98329059482

Hosted by Taboola. Talks are in English. More info about Y-DATA.

Intro:
The era of modern AI started with the rise of big data. Once you have large amounts of logged structured data, be it clicks on the products in an online store, or time spent on a certain webpage in a browser, or percentage of paid credits in a bank, data science steps in.
However, in reality, the data is often either not structured or, even worse, does not exist at all.
For example, a voice assistant will only learn to correctly activate after the model analyses thousands of hours of speech recordings made by different voices, accents, amidst surrounding noises. Further, a search engine will only learn how to rank the most relevant sites on top after “seeing” millions of pairs matching user queries and web pages documents, judged by the relevance of the match.
All the magic and power of artificial intelligence has a natural glass ceiling. And this ceiling is training data.

Agenda:
18:30 - 19:00 Registration, Mingling, Snacks & Beer

19:00 - 19:15 "Breaking the annotation barrier with synthetic data"
Lotem Peled-Cohen, ML Product Manager at Datagen

19:15-19:30 "Garbage in Garbage out: how to create quality datasets for quality models"
Inbal Horev, NLP team lead at Gong

19:30-19:45 "Quality at Scale: How Can Data Labeling Boost AI models"
Olga Megorskaya, founder & CEO at Toloka AI

20:00-20:45 Panel Discussion with experts in the field.

Abstracts:
1. "Breaking the annotation barrier with synthetic data"
The Data-Centric approach is the next frontier of the AI world, but it has its challenges and barriers. Data acquisition and annotation are a challenge we are all too familiar with, and as companies scale and require more data for larger models, this challenges becomes more and more painful. High precision annotations are hard to come by, particularly for tasks that could benefit from 3D context; this is where synthetic data comes in, answering some of these pains.
In this talk Lotem will present the hidden power of synthetic annotations for CV tasks, the challenges of combining these with human annotations, and how pixel-perfect labels can help you break the annotation barrier for multiple CV use cases.

2. "Garbage in Garbage out: how to create quality datasets for quality models"
We all know that models are only as good as the data we feed them. However, building quality datasets is inherently difficult for many reasons. Gong is lucky enough to have an abundance of data and a group of dedicated labelers readily available. Nevertheless, data set creation is something on which the company spends a significant portion of time. And, as Gong supports more and more languages, this aspect becomes even more important. How do we label data efficiently in multiple languages? How do we perform error analysis in a language we don’t understand? In this talk, Inbal Horev, an NLP team lead at Gong, will present some of the data-centric challenges Gong faced and the processes the company set in place in order to solve them.

3. "Quality at Scale: How Can Data Labeling Boost AI models"
Modern AI systems consist of a series of steps from the general idea to production, and most of them involve some kind of data manipulation: from data collection for training to test cases creation for quality control. In her talk, Olga Megorskaya, CEO of Toloka AI, will share how AI companies can use data labeling platforms to build large-scale AI systems with high quality data. She will demonstrate cases for data acquisition, data labeling, model quality assessment, and others.

Photo of Y-DATA meetups group
Y-DATA meetups
See more events
Taboola.com Ltd.
Zeev Jabutinsky 2 · Ramat Gan