Part 2: How Do You Evaluate Agents? | Evaluating AI Agents with Arize AI

Details
Evaluating large language models (LLMs) can be a daunting task, and when it comes to agentic systems, the complexity increases exponentially. In this second part of our community series with Arize AI, we will explore why traditional LLM evaluation metrics fall short when applied to agents and introduce modern LLM evaluation techniques that are built for this new paradigm.
From code-based evaluations to LLM-driven assessments, human feedback, and benchmarking your metrics, this session will equip you with the necessary tools and practices to assess agent behavior effectively. You will also get hands-on experience with Arize Phoenix and learn how to run your own LLM evaluations using both ground truth data and LLMs.
What We Will Cover:
- Why standard metrics like BLEU, ROUGE, or even hallucination detection aren’t sufficient for evaluating agents.
- Core evaluation methods for agents: LLM evaluations using code-based evaluations, LLM-driven assessments, human feedback and labeling, and ground truth comparisons.
- How to write high-quality LLM evaluations that align with real-world tasks and expected outcomes.
- Building and benchmarking LLM evaluations using ground truth data to validate their effectiveness.
- Best practices for capturing telemetry and instrumenting evaluations at scale.
- How OpenInference standards (where applicable) can improve interoperability and consistency across systems.
- Hands-on Exercise: Judge a sample agent run using both code-based and LLM-based evaluations with Arize Phoenix.
Ready for Part 3 of the series? Find it here!

Every week on Wednesday until May 28, 2025
Part 2: How Do You Evaluate Agents? | Evaluating AI Agents with Arize AI