Semi-Supervised and Unsupervised Learning Approaches to Kaggle and Cybersecurity


Details
Tonight it's all about machine learning! Whether you just started as a ML enthusiast or you're an expert with a few Kaggle medals under your belt, you will leave this event with at least a few takeaways.
We’ll be sending a Zoom link to this virtual event closer to the event date.
AGENDA
6pm | Opening Remarks + Announcements
~6:05pm | A Guide to Pseudolabelling: How to get a Kaggle medal with only one model
STANLEY ZHENG, Sir Winston Churchill & Athabasca University
~6:40pm | Topic Modeling for data discovery: A cybersecurity use case
HARINI KANNAN, Capsule8
------
TALK SUMMARY
A Guide to Pseudolabelling: How to get a Kaggle medal with only one model
Pseudolabelling is a semi-supervised learning technique popular on Kaggle with competitive data scientists and widely used in papers. Pseudolabelling is behind Amazon Alexa, AirBnb’s in-app message intent, and OpenAI’s state-of-the-art language model GPT3. It allows unlabelled data to be leveraged to create additional confident training and allow models to generalize better.
This presentation is a comprehensive guide on the applications of pseudolabelling. Various techniques used alongside pseudolabelling by top competitive data scientists, including augmentations, ensemble, and various training and validation schemes will be discussed as well as general tips and tricks. Finally, the application of these techniques to the previous ImageNet State of the Art and real world applications will be analysed.
ABOUT THE SPEAKER
Stanley Zheng is a machine learning enthusiast and 3x expert on Kaggle, ranked top 1150 in competitions. In his spare time, Stanley enjoys competing in hackathons and photography.
-----
TALK SUMMARY
Topic Modeling for data discovery: A cybersecurity use case
Topic modeling is a very useful NLP technique to analyze and classify huge corpus of data. It helps us in clustering unstructured text data into meaningful groups. In cybersecurity, filtering huge data logs to fish out leaked credentials / passwords is a huge time consuming task for red teams. A red team consists of security professionals who act as adversaries to overcome cybersecurity issues. Red teams consist of ethical hackers who evaluate system security in an objective manner. In this talk, we will go through the basics of LDA (Latent Dirichlet Allocation) topic modeling. Then by using this technique on a real world example, we’ll go through a hacker’s system logs, and try to filter out useful data like pass codes and credentials which can be used for further security analysis.
ABOUT THE SPEAKER
Harini Kannan is a Data Scientist at Capsule8, a cybersecurity company for Linux enterprise infrastructure, based in NYC. She uses statistical analysis, machine learning, and deep learning techniques to detect various Linux based exploits and user behavior profiling. Her areas of research include interpretable ML, data privacy, and language modeling. She has spoken at conferences like ODSC East - Boston, Data Science Salon - NYC and Blackhat - USA.

Sponsors
Semi-Supervised and Unsupervised Learning Approaches to Kaggle and Cybersecurity