Skip to content

LLM Evaluation Mini-Hack - Build Better Benchmarks!

Photo of Bill McIntyre
Hosted By
Bill M.
LLM Evaluation Mini-Hack -  Build Better Benchmarks!

Details

We know that in general LLMs are getting better—but how do we know how they're getting better?

Join us for a hands-on mini-hackathon to brainstorm and prototype a new custom suite for evaluating and testing LLMs.

Current benchmarks (like MMLU or GSM8K) have become optimization targets—models are “studying for the test” instead of showing real understanding.

Come and help RMAIIG build our own open-source LLM Benchmark and evaluation suite, and test for the things you think are important!

We’ll explore new approaches to evaluation:

  • Context retention
  • Reasoning and grounding
  • Trust and subjectivity
  • Adversarial behavior and failure modes
  • Creativity
  • Front Range and Colorado-Specific knowledge and skills ( Bicycle route optimization between brew-pubs. Hike planning. Ski condition regression analysis. )

Bring your laptop, favorite tools, and ideas.

We'll start out with a framework and split into small groups to sketch out and test ideas.

Start shaping a smarter, community-driven benchmark. And also, I'll bring the coffee!

Photo of AI/ML Engineering (AIE an RMAIIG Subgroup) group
AI/ML Engineering (AIE an RMAIIG Subgroup)
See more events
Founder Central by Sweater
2000 Central Ave #100 · Boulder, CO
Google map of the user's next upcoming event's location
FREE