PyData @ Lusha

Details
We would like to thank Lusha for hosting us PHYSICALLY
Agenda
18:00-18:30 Gathering and snacks
18:30-18:45 Welcome words from our host
18:45-19:15 Billion Scale Deduplication using ANN (Approximate Nearest Neighbours)| Idan Richman Goshen, Senior Data Scientist at Lusha
19:15-19:45 Network Anomaly Detection Using Transfer Learning Based on Auto-Encoders Loss Normalization| Dr. Aviv Yehezkel, Cynamics Co-Founder & CTO
19:45-20:00 A short break
20:00-20:30 Faster Pandas: Make your code run faster and consume less memory| Miki Tebeke, CEO 353solutions
20:30-21:00 TBD
============================================
Billion Scale Deduplication using ANN (Approximate Nearest Neighbours)| Idan Richman Goshen, Senior Data Scientist at Lusha
At Lusha we are dealing with contacts profiles, lots of contacts profiles. It is by nature messy, and a single entity can have several representations in this type of data. In addition to the time and money spent moving messy data through the various pipelines, it is difficult to search in, not to mention the valuable information lost in the process. It would be ideal if we could merge all records of the same entity, even if they differ slightly (“Alagra Jones”, “Alagra Smith-Jones”). Comparing combinations of all pairs is possible on a small scale, but impossible when dealing with billions of records.
A set of algorithms known as approximate nearest neighbours is becoming more popular for solving such challenges and allowing the use of text-embeddings and clustering at large scales.
This talk will offer a brief overview of ANN algorithms and demonstrate how we can apply them to get a reasonable size subset of candidates, which we can then pass into a classifier for a match/no-match outcome. I’ll demonstrate how we handle such a task at scale, how we evaluate the two steps, and the tools we use.
## ============================================
Network Anomaly Detection Using Transfer Learning Based on Auto-Encoders Loss Normalization| Dr. Aviv Yehezkel, Cynamics Co-Founder & CTO
The concept of "auto-encoder losses transfer learning". This approach normalizes auto-encoder losses in different model deployments, providing the ability to detect and classify network anomalies in a generalized way that is agnostic to the specific client. This talk is based on a recently presented paper in ACM CCS AISec 21'.
## ============================================
Faster Pandas: Make your code run faster and consume less memory| Miki Tebeke, CEO 353solutions.
We'll start by reviewing the rules of the optimization club and why you shouldn't optimize.
After that we'll see how you can measure speed and memory consumption and how to find the bottlenecks in your code. Finally we'll review some code samples and make them faster.
COVID-19 safety measures

PyData @ Lusha