Chunked Text Processing and Record Linkage


Details
This time we will have two speakers from the Dutch Statistics Office CBS.
- Chunked, dplyr based processing for text files
Edwin de Jonge, https://github.com/edwindj
R is a great tool, but processing large text files with data is cumbersome. chunked helps you to process large text files with dplyr while loading only a part of the data in memory. It builds on the execellent R package LaF Processing commands are writing in dplyr syntax, and chunked (using LaF) will take care that chunk by chunk is processed, taking far less memory than otherwise. chunked is useful for selecting columns, mutating columns and filtering rows.
- reclin: a toolkit for record linkage and deduplication
Jan van der Laan, https://github.com/djvanderlaan
Record linkage, entity resolution, data matching. All terms for determining which records belong to the same entity or object. When all records are located in the same dataset this is also called deduplication. When a unique identifier is available that is also registered without errors, this is simple. However, often one has to work with name, address and date fields that contain errors such as misspellings. The reclin package provides tools to help with this and implements one of the most used methods: probabilistic record linkage. I will try to explain the general methodology of record linkage and show how reclin can be used.

Chunked Text Processing and Record Linkage