How to run benchmarks for LLMs at scale


Details
Abstract
With the rapid growth of large language models (LLMs), benchmarking has become a critical but often misunderstood step in model evaluation and deployment. Traditional leaderboards offer limited insight into how models perform under real-world constraints. In this talk, we’ll explore how to design and run LLM benchmarks at scale: across multiple models, hardware, and load configurations. We'll cover how to build reproducible, scalable benchmarking pipelines that surface meaningful trade-offs between latency, throughput, cost, and accuracy. As part of this session, we’ll also dive into Red Hat’s newly launched Third-Party Validated Models program and share how we conducted large-scale benchmarking to support it.
About the Speaker
Roy is an AI and HPC expert with over a decade of experience in building advanced AI systems. He recently joined Red Hat through the acquisition of Jounce, where he served as the CEO. Roy is a Talpiot alumnus, holding a PhD in computer science and a GMBA.

How to run benchmarks for LLMs at scale