Evaluating & Shipping Production-Ready AI Agents
Demos are easy. Production is where AI agents break.
Join Kwasi Ankomah, Lead AI Architect at SambaNova Systems (15+ years cross-industry experience), for the finale of our agentic AI series - a deep dive into the evaluation discipline that separates fragile demos from production-ready AI agents.
What you'll learn:
🔹 Why traditional software testing fails for non-deterministic, multi-step agents — and what actually works instead
🔹 The 4 evaluator types every production agent needs: rule-based, LLM-as-a-judge, trajectory, and recovery-from-failure
🔹 How to combine evaluators into a scorecard + regression gate that runs in CI on every prompt, model, tool, or architecture change
🔹 The state-of-the-art LangSmith workflow — datasets, experiments, and trace-based evaluation that catches failure/retry loops inside subagents
🔹 The open-source alternative: LangFuse + OpenTelemetry for full observability
🔹 Online evaluation and pass^k reliability — how to score live traffic and build real statistical confidence in agent performance
Why this matters?
Multi-agent systems are non-deterministic — same input, different intermediate steps, different tool calls, different outputs. That's what makes them powerful, and exactly why traditional QA can't keep up. Without structured evaluation, every prompt tweak or model swap is a gamble on production stability.
This session is built live, on a real system, using actual traces and telemetry — not slides.
Who should come?
- AI/ML engineers working with LangGraph, CrewAI, or similar frameworks
- Data scientists and architects running production LLM systems
- Technical leads evaluating agent observability and CI tooling
Familiarity with supervisor-subagent patterns helps but isn't required (catch up via Session 5 of this series).
Bring your questions — if you've ever shipped an agent and wondered if it'll hold up, this one's for you.