Skip to content

High performance Python / Dask for Data Science

Photo of Patrick Morris
Hosted By
Patrick M. and 2 others
High performance Python / Dask for Data Science

Details

First talk starts at 7PM, we will have 5 - 10 questions after each talk.

Ian Ozsvald - Sprinting Pandas

Sometimes our Python Pandas code feels slow and sometimes we can't fit enough data into RAM. Based on recent updates to the 2nd edition of Ian's High Performance Python book and his public training classes come and learn how to get more into RAM (reducing your need for other technologies like Spark), how to quickly compile for significant speedups, how to run in parallel and which libraries you're missing that unlock additional performance benefits. You'll leave with new techniques to make your DataFrames smaller and many ideas for processing your data faster.

This talk is inspired by Ian's work updating his O'Reilly book High Performance Python to the 2nd edition for 2020. With over 10 years of evolution the Pandas DataFrame library has gained a huge amount of functionality and it is used by millions of Pythonistas - but the most obvious way to solve a task isn't always the fastest or most RAM efficient. This talk will help any Pandas user (beginner or beyond) process more data faster, making them more effective at their jobs.

Ian is a Chief Data Scientist and has worked in AI and Data Science building teams and high value IP since 1999. He's published the 2nd edition of his High Performance Python book with O'Reilly, speaks and gives keynote talks internationally and co-founded the 11,000 member PyDataLondon community which has delivered 7 years of volunteer run meetups and conferences to the community.

Carlo Scarioni - Dask for Data Science

At Simply Business we are increasingly developing Data Science projects to realise the promise of better business decisions assisted by ML models. Our Data Scientists and analysts are well versed in Python and its rich array of data-science targeted libraries like Pandas, NumPy, SciPy and scikit-learn which they use to clean, process and interpret data. These tools work great for relatively small data sets but as we get the need to process more data and do more complex processing tasks, the amount of memory we have available in a single machine becomes a problem, with the scientists constantly hitting out-of-memory-errors.This is where Dask fits. Dask offers a programming model as similar as possible to the standard library stack we mentioned (Pandas, NumPy, scikit-learn) but it adds the ability to scale to larger datasets by allowing disk use instead of RAM and parallelisation of work across machines.

Carlo is a Staff Data Engineer at Simply Business, https://sbtech.simplybusiness.co.uk/

Photo of London Python group
London Python
See more events
Online event
This event has passed