Skip to content

Chunked Text Processing and Record Linkage

Photo of Hannes Mühleisen
Hosted By
Hannes M. and David J.
Chunked Text Processing and Record Linkage

Details

This time we will have two speakers from the Dutch Statistics Office CBS.

  1. Chunked, dplyr based processing for text files
    Edwin de Jonge, https://github.com/edwindj

R is a great tool, but processing large text files with data is cumbersome. chunked helps you to process large text files with dplyr while loading only a part of the data in memory. It builds on the execellent R package LaF Processing commands are writing in dplyr syntax, and chunked (using LaF) will take care that chunk by chunk is processed, taking far less memory than otherwise. chunked is useful for selecting columns, mutating columns and filtering rows.

  1. reclin: a toolkit for record linkage and deduplication
    Jan van der Laan, https://github.com/djvanderlaan

Record linkage, entity resolution, data matching. All terms for determining which records belong to the same entity or object. When all records are located in the same dataset this is also called deduplication. When a unique identifier is available that is also registered without errors, this is simple. However, often one has to work with name, address and date fields that contain errors such as misspellings. The reclin package provides tools to help with this and implements one of the most used methods: probabilistic record linkage. I will try to explain the general methodology of record linkage and show how reclin can be used.

Photo of amst-R-dam group
amst-R-dam
See more events
Amsterdam Public Library (OBA)
Oosterdokskade 143 1011 Amsterdam · Amsterdam