Why Speech-to-Speech APIs Fail When Voice AI Needs to Evaluate
Details
Google shipped Gemini Live. OpenAI launched the Realtime API. The pitch is seductive: stream audio in, get audio back. One WebSocket, sub-second latency. But what happens when your AI needs to evaluate a human, not just chat with them?
Join Niraj Kothawade as he walks through the architecture decisions behind MasterPrep AI — a voice AI platform that interviews and assesses candidates in real-time for enterprise hiring — and why he rejected speech-to-speech in favour of a server-side orchestration pipeline. Niraj will cover how state machines detect candidate behavioural patterns in real-time, why LLMs are unreliable at enforcing hard limits, and how the pipeline enables capabilities like AI plagiarism detection that speech-to-speech makes impossible.
In this talk, you will learn:
- Why speech-to-speech APIs break down when voice AI needs to evaluate, not just converse
- How server-side state machines detect behavioural patterns like Solution Traps and Logic Gaps in real-time
- The cost reality of audio tokens vs text tokens at scale
- What the pipeline unlocks that speech-to-speech can't — structured feedback, AI plagiarism detection, and deterministic control
About Niraj
Niraj Kothawade is a product leader and founder of MasterPrep AI, a voice AI platform for candidate interviews and assessment. He has 15+ years of experience building products at scale across companies including Deputy, Flipkart, and Yahoo. Find him on LinkedIn and X.
