Skip to content

Entity Resolution – A ML Approach for When df.drop_duplicates() Isn’t Enough

Photo of Jonas Malave
Hosted By
Jonas M.
Entity Resolution – A ML Approach for When df.drop_duplicates() Isn’t Enough

Details

Description:

If you take a pandas tutorial, you might think that all of your duplicate data problems can always be fixed with a quick call to df.drop_duplicates(). However, in real-world datasets the data has often come from a wide range of disparate sources and can contain multiple references to the same entity which are not quite the same. A typo here, a different way of writing an address there, and it can quickly become a nightmare to deduplicate. Sure, you can try to create rules to catch some of the most common differences, but the magnitude of variations is often beyond the reach of rules alone. Fortunately, a rules-based approach is not your only option…
In this presentation we will walk through:

  • The importance of entity resolution (ER) and some of its applications and challenges
  • An overview of common steps involved in the ER process
  • A deep-dive into a real-world example utilizing the helpful RecordLinkage Python library

Speaker:

Emily Lynn is a Consultant in the Data & Analytics practice at Slalom. She is a data scientist who is passionate about addressing seemingly insurmountable problems with intelligent technical solutions, while making sure that every step along the way is understandable. Her training in physics and machine learning at several of the top research facilities in the world taught her to face novel and complex problems confidently and systematically, while her experience in consulting has broadened her perspective and skillset.

Event Details (In-person/Online):

This is a hybrid event. People can either attend to the in-person location or online through zoom.

In-person:
The in-person event will be held at the Slalom office on the 8th floor, suite 850.
We will have people in the lobby to ensure you can enter the building and the office. We encourage you to come to the office.

Parking details:
On-street parking is free after 5 PM on weekdays.
Food and refreshment will be provided for the in-person event!

Online:
The meeting URL should bypass the prompts to input the meeting ID and password. However, full meeting details are provided below.

Meeting ID: 953 7971 3251

Meeting Password: stlmlds

Join Zoom Meeting with this URL (bypasses ID and password):
https://slalom.zoom.us/j/95379713251?pwd=QitwOWo5Tm1kUWRjK1IvcFFmMGY3UT09

Networking - 6:30 PM to 7:00 pm
Presentation - 7:00 PM to 7:45 PM
Q&A - 7:45PM to 8:00 PM
Networking - 8:00 PM to 8:30PM

COVID-19 safety measures

Event will be indoors
There are no enforced protocols. You're welcome to be as preventive as you wish.
The event host is instituting the above safety measures for this event. Meetup is not responsible for ensuring, and will not independently verify, that these precautions are followed.
Photo of St. Louis Machine Learning & Data Science group
St. Louis Machine Learning & Data Science
See more events
Slalom Consulting
7800 Forsyth Blvd Ste 850 · Clayton, MO