Skip to content

Details

Evaluating & Shipping Production-Ready AI Agents

Demos are easy. Production is where AI agents break.
Join Kwasi Ankomah, Lead AI Architect at SambaNova Systems (15+ years cross-industry experience), for the finale of our agentic AI series - a deep dive into the evaluation discipline that separates fragile demos from production-ready AI agents.

What you'll learn:
🔹 Why traditional software testing fails for non-deterministic, multi-step agents — and what actually works instead
🔹 The 4 evaluator types every production agent needs: rule-based, LLM-as-a-judge, trajectory, and recovery-from-failure
🔹 How to combine evaluators into a scorecard + regression gate that runs in CI on every prompt, model, tool, or architecture change
🔹 The state-of-the-art LangSmith workflow — datasets, experiments, and trace-based evaluation that catches failure/retry loops inside subagents
🔹 The open-source alternative: LangFuse + OpenTelemetry for full observability
🔹 Online evaluation and pass^k reliability — how to score live traffic and build real statistical confidence in agent performance

Why this matters?
Multi-agent systems are non-deterministic — same input, different intermediate steps, different tool calls, different outputs. That's what makes them powerful, and exactly why traditional QA can't keep up. Without structured evaluation, every prompt tweak or model swap is a gamble on production stability.
This session is built live, on a real system, using actual traces and telemetry — not slides.

Who should come?

  • AI/ML engineers working with LangGraph, CrewAI, or similar frameworks
  • Data scientists and architects running production LLM systems
  • Technical leads evaluating agent observability and CI tooling

Familiarity with supervisor-subagent patterns helps but isn't required (catch up via Session 5 of this series).

Bring your questions — if you've ever shipped an agent and wondered if it'll hold up, this one's for you.

Related topics

AI/ML
Machine Learning
Business Intelligence
Data Science
Data Science using Python

You may also like