Benchmarking LLMs


Details
Join us for the monthly Full Stack Data Science Meetup where we talk about the intersection of Data Science and Full Stack Development!
In this meetup, we'll discuss benchmarking LLMs
Agenda:
5:30-6:15 - Networking and refreshments
6:15-6:30 - Introductions
6:30-7:15 - Understanding challenges with LLM Evaluations
7:15-8:00 - Assessing GPT's Accuracy and Bias with Social Media and Survey Data
Presentation: Understanding challenges with LLM Evaluations
Description: When a new LLM is released, it is typically published with a table showing its State-of-the-Art performance on MMLU or another standard "benchmark". These numbers, however, have many limitations both in terms of how they are calculated and how they translate to practical tasks. This talk will cover the main ways the LLMs are evaluated both for performance and safety, and the key factors to consider when reviewing published statistics. It will end with recommendations for developing your own evaluations using traditional and LLM-as-a-Judge evaluations.
Speaker: Anastassia Kornilova is the Director of Machine Learning and Founding Engineer at Trustible AI, a platform for Responsible AI Governance. Her projects have included creating a taxonomy of AI Risks and Mitigations and created a model for analyzing documentation readiness of upcoming AI Laws and standards. Previously, she has worked on large Machine Learning systems across sectors with a focus on Legal NLP. She holds a degree in Computer Science from Carnegie Mellon University.
Presentation: Assessing GPT's Accuracy and Bias with Social Media and Survey Data
Description: Traditional text classification methods rely heavily on large swathes of manual data labeling to achieve acceptable accuracy. This manual process is not only costly but also impractical for projects that require quick turnarounds. Trained on more text than a person can read in an entire life, Large Language Models (LLMs) have potential to perform zero-shot classification with minimum human effort. However, before employing these models responsibly, it is crucial to understand their accuracy and potential.
This project evaluated GPT on NORC's 2 distinctive human-labeled benchmarks -- 1) General Social Media Archive (GSMA) abortion-related social media posts. Each post was labeled into pro-choice vs. pro-life by the attitude. 2) AP-NORC open-ended survey. Respondents were asked problems the government should be working on. More than 10k records were manually labeled into 7 major themes (e.g. economy, health, and foreign policy) and 100 subcategories (e.g. COVID-19, gun issues, and abortion).
GPT achieved remarkable agreement with human labeling on both benchmarks and overperformed current machine learning classifiers trained on our own data. However, GPT's performance varied across respondent's political leaning and religion, raising potential issue for bias.
Speaker: Haoyu Shi is a Sr. Data Analyst at NORC's Social Data Collaboratory. He provides evidence-based insights from social media data, utilizing various computational methods, such as Natural Language Processing (NLP), spatial analysis, image detection, and social network analysis. With the rise of Large Language Models (LLMs), Haoyu is passionate about integrating AI into public opinion research, focusing on innovations like developing interview bots, classifying open-ended surveys, and detecting misinformation and disinformation. He holds a Masters degree in Quantitative Methods and Social Analysis from University of Chicago and a Bachelor's degree in Geographical Information Science (GIS) from University of California, Santa Barbara.

Benchmarking LLMs