Entity Resolution – A ML Approach for When df.drop_duplicates() Isn’t Enough


Details
Description:
If you take a pandas tutorial, you might think that all of your duplicate data problems can always be fixed with a quick call to df.drop_duplicates(). However, in real-world datasets the data has often come from a wide range of disparate sources and can contain multiple references to the same entity which are not quite the same. A typo here, a different way of writing an address there, and it can quickly become a nightmare to deduplicate. Sure, you can try to create rules to catch some of the most common differences, but the magnitude of variations is often beyond the reach of rules alone. Fortunately, a rules-based approach is not your only option…
In this presentation we will walk through:
- The importance of entity resolution (ER) and some of its applications and challenges
- An overview of common steps involved in the ER process
- A deep-dive into a real-world example utilizing the helpful RecordLinkage Python library
Speaker:
Emily Lynn is a Consultant in the Data & Analytics practice at Slalom. She is a data scientist who is passionate about addressing seemingly insurmountable problems with intelligent technical solutions, while making sure that every step along the way is understandable. Her training in physics and machine learning at several of the top research facilities in the world taught her to face novel and complex problems confidently and systematically, while her experience in consulting has broadened her perspective and skillset.
Event Details (In-person/Online):
This is a hybrid event. People can either attend to the in-person location or online through zoom.
In-person:
The in-person event will be held at the Slalom office on the 8th floor, suite 850.
We will have people in the lobby to ensure you can enter the building and the office. We encourage you to come to the office.
Parking details:
On-street parking is free after 5 PM on weekdays.
Food and refreshment will be provided for the in-person event!
Online:
The meeting URL should bypass the prompts to input the meeting ID and password. However, full meeting details are provided below.
Meeting ID: 953 7971 3251
Meeting Password: stlmlds
Join Zoom Meeting with this URL (bypasses ID and password):
https://slalom.zoom.us/j/95379713251?pwd=QitwOWo5Tm1kUWRjK1IvcFFmMGY3UT09
Networking - 6:30 PM to 7:00 pm
Presentation - 7:00 PM to 7:45 PM
Q&A - 7:45PM to 8:00 PM
Networking - 8:00 PM to 8:30PM
COVID-19 safety measures

Entity Resolution – A ML Approach for When df.drop_duplicates() Isn’t Enough