Skip to content

Data Science Canberra Meeting

Photo of Alex A.
Hosted By
Alex A.
Data Science Canberra Meeting

Details

Exploiting Redundancy, Recurrency and Parallelism: How to Link Millions of Names and Addresses with Ten Lines of Code in Ten Minutes

Accurate and efficient record linkage is an open challenge of particular relevance to Australian Government Agencies, who recognise that so-called wicked social problems are best tackled by forming partnerships founded on large-scale data fusion. Names and addresses are the most common attributes on which data from different government agencies can be linked. We present a new method to link names and addresses between two large datasets, both of which have significant data quality issues. The most common approach for dealing with quality issues is to standardise raw data prior to linking. If a mistake is made in standardisation, however, it is usually impossible to recover from it to perform linkage correctly. By contrast, we show how data linkage can be implemented without standardisation. Empirical results show that approximately 91% of the links created between two large datasets from two government agencies were correct, despite significant data quality issues. Besides being accurate, our method is also highly efficient. Linking two datasets each containing millions of rows takes less than ten minutes. Our method is also easy to maintain. It can be implemented with ten SQL statements.

Dr Yuhang Zhang - AUSTRAC

Photo of Canberra R Users Group group
Canberra R Users Group
See more events
PwC Canberra
28 Sydney Avenue, Forrest · Canberra