37th meetup



NOTE: a valid photo ID is required



Note: Please use your full real names when signing up, otherwise you may be refused entry!

As always, there'll be free beer and pizza, generously provided by our host AHL.

We are still experimenting with issuing tickets via a lottery - if you want to be in with a chance of a place - sign up for the wait list! The lottery will be run approx 1 week before the meetup, and we will re-run the lottery to fill any spaces that free up or use the wait list towards the time of the event.


Ed Cannon on How to choose the right social media Influencer

It is often the case that brand managers or PR agencies have a list of influencers that could be suitable for their marketing campaign, but which one(s) to choose? How will I make that decision? Will the influencers appeal to the target audience? Will I get the expected reach? Do they perform in line on the social media channels?

In this talk I will address these issues and introduce a novel approach based on the H-index to calculate how engaged an influencer is. I will show how I have managed to validate influencers at scale and how this work has been distributed as a service.

Bio: Ed Cannon is a senior data scientist consultant at Capgemini and has been working there for almost 2 years.Prior to Capgemini, Ed headed up the data development effort at a social media analytics start up. He has worked in the States for almost 4 years at a scientific software house developing cheminformatics software. He holds a PhD from Cambridge in cheminformatics. He has experience in retail, social media and pharmaceutical analytics and has been working in the data science space for over a decade now. He enjoys coding in python and scala in his spare time, going down the gym and skiing.


Karim Chine on RosettaHUB: A universal platform for open data science in the cloud

The RosettaHUB platform exposes a universal IDE for data scientists that breaks the silos between data science environments. The IDE makes it possible to interact with containerized hybrid kernels gluing together in a single process Python, R, Scala, SQL clients, Java, Matlab, Mathematica, etc. and allowing those different environments to share their variables in memory. A collaborative web spreadsheet exposes all the Python and R functions as formulas, maps the Python and R data to cells and can be fully programmed in Python and R. An Excel-Addin makes it possible to control containers and Python/R/Scala cloud kernels from within Excel spreadsheets. A user friendly reactive programming framework (a language-agnostic next-generation shiny) makes it possible to create reactive data science microservices and interactive web applications.
The presentation will provide an overview and a demo of the platform and will focus on it key empowering features for data scientists and python users in general.

Bio: Karim Chine is a London-based software architect and entrepreneur and the author of RosettaHUB. Previously, he held positions within academic research laboratories and industrial R&D departments, including Imperial College London, EBI, IBM, and Schlumberger. Karim’s interests include large-scale distributed software design, cloud computing applications in research and education, open source software ecosystems, and open science. Since 2009, he has collaborated with the European Commission as an independent expert for the research e-infrastructure program and for the future and emerging technologies program. He has also served as an evaluator and a reviewer of many of EU’s flagship projects related to grids, desktop grids, scientific clouds, and science gateways.


Lightning talks:

Oliver Parson (http://hivehome.com) on Energy disaggregation

Wouldn't it be great if your electricity bill told you how much energy each appliance had consumed, rather than just the the total cost over all of your appliances? This is the focus of energy disaggregation - which aims to produce this appliance-level breakdown with just a smart meter and machine learning algorithms, rather than the installation of a sensor on every appliance. In this talk I'll cover a range of publicly available data sets and algorithms, and also an open-source python toolkit to get you started.

Ryan Varley on Feature importances in random forests

Many machine learning algorithms can be described as black boxes, but being able to understand how a model has arrived at its predictions is important both for explaining results to stakeholders and for our own trust in the model. Which features did it use? How much does the output rely on any one feature? Are there any signs of bias in our training data? One view into answering these types of questions are feature importances.

In this talk we discuss some of the different measures of feature importances, their differences and how we use them at GrowthIntel. In particular we focus on random forests, the current scikit-learn implementation, and our pull request to add permutation importances.



Doors open at 6.30 (get there early as you have to sign-in via AHL's security), talks start at 7 pm, beers from 9 pm in the bar. We normally have > 200 folks in the room so there's plenty of people to discuss data science questions with!

Please unRSVP in good time if you realise you can't make it. We're limited by building security on the number of attendees, so please free up your place for your fellow community members!

Follow @pydatalondon (https://twitter.com/pydatalondon) for updates and early announcements. See you on the 5th!