PyData London - 45th Meetup


NOTE: A valid photo ID is required by building security. Please use your full real names when signing up, otherwise you may be refused entry!


As always, there'll be free food & drinks, generously provided by our host, AHL.

We are issuing tickets via a lottery - if you want to be in with a chance of a place - sign up for the waitlist! The lottery will be run approx 1 week before the meetup, and we will re-run the lottery to fill any spaces that free up or use the waitlist towards the time of the event.


Main Talks:

Trevor Sidery on "Forecasting at scale using PySpark":

Tesco relies on having accurate forecasts to power all parts of its business. From how much stock to order to a store, how many staff are needed in the store, and how many delivery vans we need to fulfill customer demand – forecasts are everywhere. Historically, forecasts have been done by multiple teams with differing levels of analytics capabilities. Knowledge from each team was siloed and improvements in one forecast would not help others. We will talk about how we built a single framework for building hundreds of forecast models in parallel using PySpark. We will cover different forecasting techniques used, how we evaluate performance but also how we developed the framework to be flexible to allow us to test many different forecasting techniques. We will also touch on the deployment of the solution on a large Hadoop cluster and how to put all this into production.

Philip Goddard on "Revolutionise your Machine Learning Workflow with Scikit-Learn Pipelines":

The Scikit-Learn library is one of the cornerstones of the Python stack for data science, providing a clean and consistent API for building machine learning models. However, due to the nuances, a modeler will encounter with any data set, maintaining a clean, reproducible workflow can be challenging when faced with various permutations of feature selection and pre-processing before training an algorithm.

In this talk, I will demonstrate the features and advantages of a pipeline approach by using it in the context of a supervised machine task, specifically building a model to predict customer churn. It will be shown how pipelines can be used all the way from data pre-processing and feature selection, through to model selection.

By using a pipeline approach, machine learning workflows can become increasingly elegant, modular and reusable. Within the Scikit-Learn implementation, only a small learning curve is required to obtain these advantages.


Lightning Talks:

Pavlos Mitsoulis-Ntompos : "Sagify: Train/Deploy your models on AWS in a few simple steps"
A command-line utility to train and deploy ML/DL models on AWS SageMaker in a few simple steps!

Lucija Gregov: "Making an open source contribution to sklearn"



Doors open at 6.30 (get there early as you have to sign-in via AHL's security), talks start at 7 pm, drinks from 9 pm in the bar. We normally have >200 folks in the room so there's plenty of people to discuss data science questions with!

Please unRSVP in good time if you realize you can't make it. We're limited by building security on the number of attendees, so please free up your place for your fellow community members!

Follow @pydatalondon ( for updates and early announcements. See you on the 5th!