Skip to content

Missing Data Imputation Using Supervised Machine Learning

Photo of Yanchang Zhao
Hosted By
Yanchang Z.
Missing Data Imputation Using Supervised Machine Learning

Details

This is a joint event with the Statistical Society of Australia (SSA) Canberra Branch and the IEEE ACT Section. The talk will be of 45m presentation plus 15m Q&A.

Topic: Missing Data Imputation Using Supervised Machine Learning
Speaker: Marcus Suresh, Department of Industry, Science, Energy and Resources
Date and time: 6-7pm Tuesday 28 July 2020
Venue: Virtual via Zoom
RSVP: https://www.meetup.com/CanberraDataSci/events/266992810/
After RSVP, please also register for the Zoom meeting under "Online event" in a right side panel on the event page.

Abstract:
Incomplete data are quite common which can deteriorate statistical inference, often affecting evidence-based policy making. A typical example is the Business Longitudinal Analysis Data Environment (BLADE), an Australian Government’s national data asset. In this paper, motivated by helping BLADE practitioners select and implement advanced imputation methods with a solid understanding of the impact different methods will have on data accuracy and reliability, we implement and examine performance of data imputation techniques based on 12 machine learning algorithms. They range from linear regression to neural networks. We compare the performance of these algorithms and assess the impact of various settings, including the number of input features and the length of time spans. To examine generalisability, we also impute two features with distinct characteristics. Experimental results show that three ensemble algorithms: extra trees regressor, bagging regressor and random forest consistently maintain high imputation performance over the benchmark linear regression across a range of performance metrics. Among them, we would recommend the extra trees regressor for its accuracy and computational efficiency.
Link to paper: https://link.springer.com/chapter/10.1007/978-3-030-35288-2_18
Link to PyImpuyte: https://pypi.org/project/PyImpuyte/

About the Speaker:
Marcus is a Data Scientists and Economist in the Analysis and Insights Division(AID) at the Department of Industry, Science, Energy and Resources (DISER) and a former Visiting Scientist at CSIRO - Data61. He specialises in applying data science techniques to structured and unstructured data to support the advancement of public policy at DISER.

Marcus has a wealth of experience across several Commonwealth Government agencies. He started his career as an Economist at the Commonwealth Treasury where he worked on Financial Market and Taxation policy before joining the Department of Education and Training where he provided economic advice to support the then Government’s Higher Education Reform Bill. Marcus was seconded to the Department of Prime Minister and Cabinet’s, Behavioural Economics Team of the Australian Government (BETA) and co-authored a randomised control trial with the ATO to investigate the effects of behavioural treatments at driving improved compliance with the Deferred GST Scheme.

Marcus is a Master of Data Science candidate at the University of Sydney and holds a Master of Public Policy (Economic Policy) from the ANU and Bachelor of Economics(Hons) and Commerce from Murdoch University. His research interests are in computer vision and natural language processing.

LinkedIn profile:
https://au.linkedin.com/in/marcus-suresh

Website link:
SSA Canberra: https://statsoc.org.au/event-3897088?CalendarViewType=1&SelectedDate=7/3/2020
Meetup: https://www.meetup.com/CanberraDataSci/events/266992810/

Photo of Canberra Data Scientists group
Canberra Data Scientists
See more events