PyData Dublin #3


Details
https://secure.meetupstatic.com/photos/event/5/e/b/0/600_466224240.jpeg
We're excited to announce that the agenda for our third PyData Dublin meetup is now set! We have two great talks lined up, each with a focus on providing insights into industry approaches to data and machine learning architectures. See you at Udemy Dublin (https://www.udemy.com) on Wednesday, 29th November.
AGENDA:
[18:30 - 19:00] Registration, networking, and refreshments.
[19:00 - 19:40] Introducing StatwolfML, a pipeline-oriented data processing and machine learning approach by Krystyna Isakova & Pasquale Boemio
[19:40 - 20:20] Understanding customer intent at scale; an architecture overview by Humberto Corona & Sergio Gonzalez Sanz.
[20:20 - 20:30] Wrap-up.
TALK DETAILS:
Title: Introducing StatwolfML, a pipeline-oriented data processing and machine learning approach
Presenter: Krystyna Isakova ( https://ie.linkedin.com/in/krystyna-isakova-28044519)
Presenter: Pasquale Boemio ( https://www.linkedin.com/in/pboemio )
Abstract: In this talk, we would like to show you StatwolfML, a brand new data-oriented machine learning framework. It combines the state of the art technologies and provides unified and pipeline-oriented interface for the data preprocessing, modelling and prediction.
Title: Understanding customer intent at scale; an architecture overview
Presenter: Humberto Corona (https://ie.linkedin.com/in/humberto-corona-86394328)
Presenter: Sergio González Sanz (https://ie.linkedin.com/in/sergio-gonzález-sanz-4b369668)
Abstract: Zalando is an European Fashion platform with a yearly revenue of ~3.6 Billion Euro. We have more than 20Million active customers and more than 200 Million visits per month. Our tech department has around 1700 people across 3 different countries. Operating in Germany is very interesting from a data protection point of view (specially for products like this)
In this talk we present a technical overview of customer intent; a product that assigns a state (exploring, gathering, comparing or deciding) to each customer at any given point in their customer journey in the Zalando shop. We will introduce the problem of customer intent and briefly present our unsupervised approach to solve this model which uses a Hidden Markov Models algorithm. During this talk, we will explain the main challenges we faced on each of the steps when building, and the lessons learned from building this product from an engineering perspective.
We will introduce our architecture, the reason behind using PySpark to build our product and how we made extensive use of Apache Zeppelin notebooks and branch-specific deployments in AWS EMR clusters for early experimentation. We will then show how we rewrote parts of the Python HMMLearn library using PySpark to achieve almost linear scalability. Finally, we will explain our use of AWS Data Pipelines to run daily jobs for both feature creation and scoring Zalando customers across 6 different countries, and how we support our product being used by several personalization products in different contexts.

Sponsors
PyData Dublin #3