We have two speakers in September:
- Martin Loetzsch *Lightweight ETL pipelines with Mara*
- Mariano Semelman *Query expansion using semantic query embeddings*
*Lightweight ETL pipelines with Mara*
In the past few years, data warehousing went through a radical transition from using click-based ETL tools to using code for defining data pipelines. In this process, the field of “data engineering” was born, Python became the dominant language for describing data integration pipelines and Apache Airflow emerged as the dominant framework in the field. However, for most companies that don’t operate at the scale of Airbnb, Airflow is quite an overkill when the task is to integrate a few GB or TB of data. In this talk, I will introduce Mara as a lightweight opinionated ETL framework halfway between Airflow and plain python scripts, with a focus on transparency and complexity reduction. It condenses the learnings from 6 years of building data warehouses for more than 20 of the portfolio companies of Project A. I will guide you through some of the design decisions behind the platform and some general learnings for setting up successful data engineering teams.
Martin Loetzsch works at Project A, a Berlin-based operational VC focusing on digital business models. As Chief Data Officer, he has helped many of Project A’s portfolio companies forming teams that build data warehouses and other data-driven applications. Before joining Project A (with a short interlude at Rocket Internet), he worked in artificial intelligence labs in Paris and Brussels on computational linguistics and robotics. He received a PhD in computer science from the Humboldt University of Berlin.
*Query expansion using semantic query embeddings*
In the world’s largest flea market finding that perfect something can be challenging. OLX powers marketplaces across 40 countries all over the world and every single
item listed is unique. Navigating such large and diverse marketplaces can be intimidating and error-prone; this is why we developed our Query Expansion technology to assist the users in finding the items they love.
When people search in our site they don’t always remember how to correctly write brands, or the possible abbreviations of the product they are looking for or the
nicknames people use for product names. They will wonder, should I write playstation 4 or play station 4 or ps4? How is peugeot spelled? This leads to confusion among both our buyers and sellers and creates barriers for a successful
In order to solve this we developed a synonym expansion solution that runs in real time at the moment a user executes a query. These synonyms were created using
embeddings. In this occasion user behaviour was used as implicit information to generate this semantic embeddings. Join us to learn more about how we implemented it, the challenges and surprises found along the way as well as the impact it has had on the product
Mariano Semelman is a Senior Data Scientist in OLX Tech Hub Berlin, originally from Argentina. Computer science M.Sc. with specialization in recommender systems and personalization. He has almost 10 years of engineering experience and has been working for the last 6 years in ecommerce as Data Scientist. Currently working for Personalization and Relevance team to improve recommenders and search functionalities.