About us
Submit a talk: https://london.pydata.org/submit-a-talk/
PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.
The PyData Code of Conduct governs this meetup. To discuss any issues or concerns relating to the code of conduct or the behavior of anyone at a PyData meetup, please contact NumFOCUS Executive Director Leah Silen (+1 512-222-5449; [leah@numfocus.org](mailto:leah@numfocus.org)) or the group organizer.
Upcoming events
2

PyData London - 107th Meetup
EC4R 3AD, London, GBVenue: Riverbank House, 2 Swan Ln, London EC4R 3AD
Please note:
1. 🚨🚨🚨 A valid photo ID is required by building security. 🚨🚨🚨
2. This event follows the NumFOCUS Code of Conduct. Please familiarise yourself with it before attending.If your RSVP status says "You're going" you will be able to get in. No need to show your RSVP confirmation when signing in.
If you can no longer make it, please unRSVP as soon as possible.Code of Conduct:
This event follows the NumFOCUS Code of Conduct. Please get in touch with the organisers with any questions or concerns.As always, there will be free food and drinks, generously provided by our host, Man Group.
Main Talks
- Robin Linacre - Rapid deduplication and fuzzy matching of large datasets using Splink
This talk introduces Splink, a free and open source Python library for probabilistic record linkage and deduplication at scale.
I’ll explain the core ideas behind the approach, show how it can be used in practice, and reflect on the opportunities of building open source tools from within the public sector.
Data deduplication is a common data quality challenge: it arises whenever multiple records are collected about the same person, organisation, or entity without a shared unique identifier.
This talk will introduce the core ideas behind probabilistic record linkage and show how Splink, a free and open source Python library developed at the UK Ministry of Justice, can be used to link and deduplicate large datasets quickly and accurately.
I’ll cover the practical steps involved in training linkage models, using unsupervised learning, improving performance and accuracy, and visualising results and diagnostics. I’ll also touch on what it has been like to build and maintain open source tooling from within the UK public sector, and the opportunities this creates for solving problems at citizen scale.
The talk is aimed at people with basic experience working with tabular data in Python, and assumes no prior knowledge of record linkage. - Filip Makraduli - Small Model Inference Infrastructure: Lessons Beyond vLLM
Embedding models are evolving rapidly, from dense-only architectures, to hybrid dense+sparse (BGE-M3), multi-vector late interaction (ColBERT), and vision-language models (ColQwen2, SigLIP). Yet while LLM serving infrastructure is well-established (vLLM, SGLang), small model deployment still relies on inefficient single-model containers or custom-built systems. Production teams face a dilemma: deploy dozens of single-model containers with 5-10% GPU utilization, or invest months building complex multi-model infrastructure. This talk explains some of the key concepts of embedding inference and how it is different to LLM inference. - Ian Ozsvald - Build your own LLM, Live, with MicroGPT
Andrej Karpathy's MicroGPT is a single file no dependency mini GPT, you can run it live with me. We'll talk about how GPT works, we'll look at the code, we'll reflect on the Torch-based nanoGPT "big brother" and you'll end the talk with stronger intuitions about how our next best token guessing overlords are working. Plus you'll have a mini GPT running on your laptop in 200 lines of Python.
--------------------------------
Logistics- Doors open at 6.30 pm (get there early as you'll need to sign in with building security).
- Talks start at 7:00 pm, with drinks afterwards from 9:00 pm at The Banker (EC4).
We have reduced capacity for this event, but there will be plenty of people to discuss data science questions with.
Please unRSVP in good time if you realise you can't make it. We're limited by building security on the number of attendees, so please free up your place for your fellow community members.
If you want me to trim lightning talks down to two or shorten any abstracts, say which ones.158 attendees- Robin Linacre - Rapid deduplication and fuzzy matching of large datasets using Splink
Past events
125




