Data Science Canberra Meeting


Details
Exploiting Redundancy, Recurrency and Parallelism: How to Link Millions of Names and Addresses with Ten Lines of Code in Ten Minutes
Accurate and efficient record linkage is an open challenge of particular relevance to Australian Government Agencies, who recognise that so-called wicked social problems are best tackled by forming partnerships founded on large-scale data fusion. Names and addresses are the most common attributes on which data from different government agencies can be linked. We present a new method to link names and addresses between two large datasets, both of which have significant data quality issues. The most common approach for dealing with quality issues is to standardise raw data prior to linking. If a mistake is made in standardisation, however, it is usually impossible to recover from it to perform linkage correctly. By contrast, we show how data linkage can be implemented without standardisation. Empirical results show that approximately 91% of the links created between two large datasets from two government agencies were correct, despite significant data quality issues. Besides being accurate, our method is also highly efficient. Linking two datasets each containing millions of rows takes less than ten minutes. Our method is also easy to maintain. It can be implemented with ten SQL statements.
Dr Yuhang Zhang - AUSTRAC

Data Science Canberra Meeting