38th meetup



NOTE: a valid photo ID is required



Note: Please use your full real names when signing up, otherwise you may be refused entry!

As always, there'll be free beer and pizza, generously provided by our host AHL.

We are still experimenting with issuing tickets via a lottery - if you want to be in with a chance of a place - sign up for the wait list! The lottery will be run approx 1 week before the meetup, and we will re-run the lottery to fill any spaces that free up or use the wait list towards the time of the event.


Andrew Stretton on Frictionless Data, Frictionless Development: Building a scalable data converter, processor and warehouse with tabulator, tableschema-py and datapackages-pipelines

A common problem in Data Engineering is how to create a platform capable both of importing and exporting tabular data in numerous formats and of maintaining a change history of the data while users update and query it.

Tools like Trifacta (Google Cloud Dataprep) provide a turnkey solution to part of the pipeline but the open source Frictionless Data tools from OKFN can provide a simpler subset of these features tailored to your requirements.

Just as Pandas is built around the Dataframe, the Frictionless Data approach uses data packages consisting of a JSON table schema and a data URI. These schemata can be easily generated for any dataset and work well for a number of applications such as validating new data with tools like Goodtables or tableschema-py, building a data update interface with tools such as Handontable JS, creating declarative data processing pipelines that a front end can easily interact with via datapackages pipelines, pushing data into various databases and repository tools such as CKAN datastore and extending the schema to allow export to linked data formats such as IIIF.

The talk will cover these use cases and compare with the approaches taken by other open-source data science / BI tools such as Datashape with ODO from Continuum and Superset from AirBnB. I will aim to demonstrate that that lightweight web standards like datapackages speed up the development process.

Bio: Andy is a software engineer at Zegami working on a next-generation visual analytics tool that combines images and data in a unique way. Andy's career adventure started in the lab researching among other things OLED materials for Samsung. Via a small fumehood explosion he found himself temping in data validation for a Word of Mouth marketing agency. As he enjoyed the environment but not the work he set about learning to code and spent 2009 to 2014 creating various big data social analytics apps on a shoestring. At this point he decided to go back into academia for a couple of years working on an interesting Cheminformatics project with a whimsical thought that he might meet some great people to start a company. Proving that you should always follow your heart, that whimsical thought became a reality and Zegami was spun out of the University of Oxford in 2016 with Andy as the first hire. Outside of work Andy enjoys family time and long bike rides.

Slawomir Tulski on Robust extraction of web data with Python

Modern web is endless source of all kinds of data. Viewing those data only through web browser is limiting. This is where web scrapers come to play. Ability to programmatically access and extract internet resources opens a new broad range of possibilities for data scientists and data engineers.

This talk aims to explain methodology and technologies which extends beyond simple html parsing. You will see that crawling whole website is unnecessary or that often you can get all data you need without even seeing page source. You will realize that there is a lot of hidden APIs around the web to which you can plug in, and that you do not need to be afraid of getting what you need from dynamic pages loaded with JavaScript.


Lightning talks:

Miroslav Batchkarov on PyOrbital

Miroslav will introduce https://github.com/skimit/orbital, a tool for distributing and versioning private resources through S3.

Michal Mucha on Reducing runtimes with Cython

Michal will compare performance of Cython vs NumPy vs pandas.



Doors open at 6.30 (get there early as you have to sign-in via AHL's security), talks start at 7 pm, beers from 9 pm in the bar. We normally have > 200 folks in the room so there's plenty of people to discuss data science questions with!

Please unRSVP in good time if you realise you can't make it. We're limited by building security on the number of attendees, so please free up your place for your fellow community members!

Follow @pydatalondon (https://twitter.com/pydatalondon) for updates and early announcements. See you on the 3rd!