Date: 30 May 2026 (Sat)
Time: 15:30 - 18:00 *HKT
Coordinator: Ming Tou, Alex Au
Light Refreshments will be available!
This is a workshop for Python Data Cleaning. Please bring along your laptop with Python3 installed with enough battery! Winners will get a small prize as an accomplishment in the next meetup.
Please fill in the form for admission:
https://forms.gle/MGMGFiMZQGY4CZuP7
Mind the Data Gap
This model will be delayed due to: missing values, outliers, and delimiter congestion.
In this hands-on HKPUG workshop, we are turning a Python data cleaning exercise into a gamified team sprint.
You will work with a synthetic Hong Kong-inspired transport dataset. The data is messy on purpose: missing values, weird outliers, inconsistent categories, URL-encoded station names, confusing delimiters, and fields that look useless until you extract the hidden signal.
Your mission is simple:
Clean the data. Extract better features. Run the scorer. Beat the model.
You do not need to tune the model.
You need to rescue the data.
How the Challenge Works
Teams will write a Python data transformation script to clean and enrich the dataset.
Our scoring pipeline will then:
- Run your transformation on the training and test data
- Remove any label / target leakage columns
- Train a fixed Gradient Boosting model
- Evaluate the model using proxy metrics such as F1 and ROC-AUC
- Convert the result into a data cleaning score
This is not meant to be a perfect data quality metric. It is an accessible proxy: if your cleaning and feature extraction help the model perform better, your team gets a better score.
Capacity: 60 (Team of 4)
Venue Info:
City University of Hong Kong, Kowloon Tong (Exact Location TBC)
Rundown:
30th May
15:30 – 15:40 Opening Remarks, Team Formation & GitHub Setup
15:40 – 15:55 Briefing: Data Traffic Report, Challenge Rules & Scoring Pipeline
15:55 – 16:20 Demo: Messy Dataset, Data Cleaning Strategy & `transform.py`
16:20 – 17:10 Team Sprint Part 1: Missing Values, Outliers & Category Cleanup
17:10 – 18:00 Team Sprint Part 2: Feature Extraction, Hidden Signals & Scoring Iterations
18:00 - Networking outside City University
13th June
- 23:59 Winner Team Announcement
Audience pre-requisite:
- Hardware: MUST bring a laptop with Python 3 installed and a code editor (VS Code, PyCharm, etc.).
- Skills: Recommended having basic to intermediate-level knowledge of Python (handling JSON, using the `requests` library, and writing basic `if/else/while` loops).
How to join?
- Fill in the form for admission: https://forms.gle/MGMGFiMZQGY4CZuP7
- Click "Attend" on this page