LLM Evaluation Mini-Hack - Build Better Benchmarks!

Name: LLM Evaluation Mini-Hack - Build Better Benchmarks!
Start: 2025-06-19T08:30:00-06:00
End: 2025-06-19T10:30:00-06:00
Location: Founder Central by Sweater

Hosted By

Bill M.

LLM Evaluation Mini-Hack - Build Better Benchmarks!

Details

We know that in general LLMs are getting better—but how do we know how they're getting better?

Join us for a hands-on mini-hackathon to brainstorm and prototype a new custom suite for evaluating and testing LLMs.

Current benchmarks (like MMLU or GSM8K) have become optimization targets—models are “studying for the test” instead of showing real understanding.

Come and help RMAIIG build our own open-source LLM Benchmark and evaluation suite, and test for the things you think are important!

We’ll explore new approaches to evaluation:

Context retention
Reasoning and grounding
Trust and subjectivity
Adversarial behavior and failure modes
Creativity
Front Range and Colorado-Specific knowledge and skills ( Bicycle route optimization between brew-pubs. Hike planning. Ski condition regression analysis. )

Bring your laptop, favorite tools, and ideas.

We'll start out with a framework and split into small groups to sketch out and test ideas.

Start shaping a smarter, community-driven benchmark. And also, I'll bring the coffee!

Events in Boulder, CO AI Algorithms Python

Artificial Intelligence Machine Learning