Skip to content

PyData @ Citi

U
Hosted By
Uri G.
PyData @ Citi

Details

We would like to thank Citi Innovation lab for hosting us.
Agenda
18:00-18:30 Gathering
18:30-18:45 A word from our sponsor
18:45-19:15 Different Approaches for Document Augmentation (Dr. Shelly Aviv, Senior & Eylon Gueta, Citi)
19:15-19:30 Break
19:30-20:00 Generating Synthetic Data at Scale with the Help of Modern Execution Technologies (Or Sher / Datagen)
20:00-20:30 Semantic column matching (Ran Dan / Argmax)

Different Approaches for Document Augmentation

In the last few years deep learning models and architecture are rapidly evolving, which result an ongoing improvement in the performance of different NLP tasks. However, as advanced the cutting-edge models would be, one of the major bottleneck in their daily usage is the amount of annotated data that is available for their training. Though different methods for data augmentation were successfully applied in image processing, in NLP data augmentation is still maturing. In this talk we will present different approaches for tackling the limited dataset size issue, by using data augmentation and synthetic data generation. Text documents may contain several different formats of textual data. Our methodologies make use of different ways of augmentation, based on the input ontology and its positional coordinates in the document.

========================
Generating Synthetic Data at Scale with the Help of Modern Execution Technologies
Speaker: Or Sher, Infrastructure Team Lead, Datagen

Datagen started creating synthetic images using on-premise consumer GPU machines which did not provide the flexibility and scalability required for larger scale operations.
We needed a scalable system that enables large-scale generation of 3D environments, a CPU intensive process, and rendering the images from within the 3D environments, a GPU intensive process.
This presentation will share our journey of building our internal K8s based, cloud agnostic system to enable us to provision and utilize thousands of GPU and CPU resources exactly and only when we need them..
We will cover aspects of reliability, performance, efficiency, cost optimization, and also:

What is synthetic data
The challenges of generating simulated data at scale serving many customers.
Architecture and coding challenges
Move fast and keep code clean

========================
Semantic column matching

In the data age, we are swamped by various data sources with different naming conventions and query styles.
In this talk we would go over a solution we developed for a client to match column names and schemas across various data sources.
We would demonstrate how word2vec and dynamic programming assist us in semantic matching.

Photo of PyData Tel Aviv group
PyData Tel Aviv
See more events
Citi Innovation Lab TLV
· Tel Aviv-Yafo