Engineering voice agents: latency, quality, and scale — Rishabh Bhargava, Together AI

AI Engineer24m 35sTranscript ✅Added Jun 1, 1:51 am GMT+8

Speaker: Rishabh Bhargava, Together AI
Duration: 24:35

Actionable Insights

Budget latency as an end-to-end conversation SLO, not a model benchmark. Start with a spreadsheet or trace dashboard that splits every turn into STT finalization, LLM time-to-first-token, tool/classifier/guardrail calls, TTS time-to-first-audio, network hops, and client playback buffering. Bhargava’s practical targets are roughly: human turn timing around 300 ms, noticeable lag above ~500 ms, STT finalization near 100 ms P90 in optimized systems, LLM TTFT around 200–300 ms, and TTS real-time factor below 1. Evaluate on P50/P90/P99 “user stops speaking → first useful audio heard,” not isolated engine latency. Caution: if you add safety classifiers or routing models, give each a hard per-turn budget before it enters production.
Build the first production version as a cascaded pipeline unless your use case can tolerate weaker tool fidelity. The talk’s reference architecture is: media/orchestrator → streaming STT → LLM/tool policy → TTS → media back to user. Use an orchestrator such as LiveKit Agents or Pipecat, then swap STT/LLM/TTS providers behind consistent interfaces. First experiment: instrument a single call flow with synthetic tasks like “change reservation,” “refund order,” and “book appointment,” then score transcript accuracy, tool-call validity, interruption handling, and first-audio latency. Expected benefit: debuggability and component-level evals. Caution: cascading loses prosody/emotion when speech becomes text, so keep full audio logs where consent/compliance permits.
Treat STT as a business-critical extraction step, not a commodity transcript. Create domain keyword tests for names, medicines, SKU/order IDs, addresses, and multilingual accents; plain aggregate WER can hide the failures that break the workflow. Bhargava cites ~6% WER as state-of-the-art on open benchmarks and warns that STT mistakes propagate through the LLM and TTS. Try streaming-native ASR where possible; NVIDIA’s Nemotron Speech Streaming and NeMo cache-aware streaming examples are relevant to the architecture he describes. Evaluate with per-slot accuracy, end-of-utterance latency, and false turn-end rate.
Keep the voice LLM small enough for TTFT, then recover task quality with routing, fine-tuning, and evals. Bhargava’s rule of thumb is that 8B–30B models often fit the voice latency envelope better than larger models, but tool calling must remain reliable. Build a tool-call eval suite that checks JSON/schema validity, correct tool choice, correct arguments, and whether the requested action is actually possible. If the small model fails domain-specific calls, fine-tune or distill for that workflow before jumping to a much larger hosted model. Caution: a model that streams a plausible conversational filler can still make an irreversible bad tool call; gate external side effects separately.
Co-locate orchestrator, STT, LLM, TTS, and guardrails when shaving tens of milliseconds matters. The talk’s concrete example is reducing model/network distance from ~75 ms to ~5 ms, producing roughly a 30% reduction in an already optimized setup. Practical first step: deploy one region-local stack and compare it against a cross-region vendor mix using the same scripted calls. Include network latency in traces; do not report only provider engine latency. Caution: data residency, failover, and capacity reservations may matter more than the last 20 ms for regulated or global deployments.
Design guardrails before speech is spoken. In the Q&A, a participant notes that once the agent says an unauthorized discount, the damage is hard to undo. Put classifiers/routing before the main LLM where they reduce ambiguity, and response validators before TTS where they can still block unsafe or unauthorized content. Consider a “thinker/talker” pattern: a fast small model keeps the conversational turn alive (“let me check that”) while a larger or more guarded model performs the heavy reasoning/tool step. Evaluate with adversarial prompts, policy-boundary cases, and “spoken before blocked” incidents.
Pilot speech-to-speech models, but keep pipeline observability as the production baseline. OpenAI’s Realtime API now supports speech-to-speech agents, tool calls, and interruption handling; NVIDIA has also released speech/voice models in this direction. These systems can preserve tone, hesitation, backchannels, and barge-ins better than text pipelines. Run them as an A/B prototype for conversational naturalness and interruption handling, but require full-call evals, transcripts for auditability, and tool-call correctness thresholds before replacing a pipeline stack.

Core thesis

Voice agents are no longer primarily a research demo problem; they are a systems engineering problem where latency, intelligence, naturalness, reliability, observability, and scale all have to be solved simultaneously. The current production-favored answer is a cascaded STT → LLM/tools → TTS pipeline, while the future likely shifts toward speech-to-speech models once instruction following, tool use, evals, and auditability mature.

Big ideas / key insights

Latency is cumulative. A voice agent can have individually fast models and still feel slow because each model hop, network round trip, classifier, guardrail, and playback buffer adds up.
Turn detection is still a hard product problem. False end-of-turn detection causes agents to talk over users; overly conservative detection makes agents feel sluggish.
The LLM is usually the latency and cost center. Bhargava frames LLM latency/cost as the largest share, followed by TTS and then STT, which is why model size selection matters.
Production voice systems grow extra components. Real deployments add classifiers, routers, guardrails, domain tools, and sometimes a thinker/talker split; each addition improves control but taxes latency.
Speech-to-speech is promising but not yet a universal replacement. Its advantage is preserving paralinguistic information and enabling full-duplex interaction; its weakness is still production-grade task fidelity and observability.

Best timestamped moments with interpretation

2:16–3:49 — Why voice agents are hard. Bhargava frames voice as a natural interface but immediately grounds it in four constraints: real-time response, enough intelligence for ambiguous workflows, pleasant/natural voice, and reliability at 100–10,000 concurrent calls.
4:20–5:22 — The dominant production architecture. The media stream enters an orchestrator, then STT, LLM/tool layer, and TTS. This is the clearest engineering map for teams building today.
5:22–8:24 — STT quality and streaming-native ASR. The important point is not “use Whisper” but “batch-trained models force workaround engineering.” Streaming-native encoders with limited lookahead and cached activations better match live voice.
8:24–9:27 — LLM TTFT and model-size tradeoff. The 200–300 ms TTFT goal pushes teams toward smaller models, which then increases the importance of fine-tuning and tool-call evals.
11:00–14:05 — Co-location and global deployment. The 75 ms → 5 ms network example is a useful reminder that infrastructure placement can be as important as model choice.
14:37–16:07 — Speech-to-speech future. The strongest argument for speech-to-speech is not simplicity alone; it is preserving tone, emotion, hesitation, interruption, and backchannels that text pipelines discard.
19:13–21:45 — Guardrails and thinker/talker pattern. The Q&A is practically valuable: guardrails inserted after generation can be too late if TTS has already spoken. A small conversational model plus larger guarded model is one plausible architecture.

Practical workflow

Define one narrow voice workflow and write 30–100 scripted call scenarios with success criteria.
Build a cascaded pipeline with explicit trace spans: VAD/turn detection, STT, LLM, tools, guardrails, TTS, network, playback.
Set launch gates: end-to-end first-audio P90, tool-call schema validity, task completion rate, false interruption rate, barge-in recovery, and domain keyword accuracy.
Add guardrails only with a latency budget and a “blocked before speech” metric.
Run region/co-location experiments before changing model families.
Prototype speech-to-speech separately and compare full-call outcomes, not just subjective naturalness.

Comment insights

No comments were extracted for this video, so there is no audience-derived evidence to synthesize. The only “comment-like” practitioner input comes from the live Q&A: attendees pressed on voice-to-function-call evals, co-location mechanics, guardrail placement, and observability for speech-to-speech systems. Those questions reinforce that production builders are worried less about demo quality and more about measurable tool correctness, network placement, policy enforcement before TTS, and audit logs.

Verdict

Bottom line: agree with the talk’s production bias toward cascaded voice-agent pipelines, with medium-high confidence. The talk is strongest when it treats voice agents as latency-budgeted distributed systems rather than demos. The main overclaim is the implied universality of exact timing/model-size heuristics; real thresholds depend on domain, fillers, tool latency, and user expectations. The practical takeaway is to launch with a measured pipeline, prove tool correctness and guardrail timing, then evaluate speech-to-speech as an optimization path.

Deep research on the main claims

Claim 1: Human-like voice agents need sub-second, often ~300–500 ms, response timing.

Support: Turn-taking research in conversation analysis generally finds human gaps between turns are short, often around a few hundred milliseconds; recent voice-agent engineering guides such as Hamming AI’s 2026 latency guide also summarize a 200–300 ms conversational target. Bhargava’s warning that 1–2 second delays cause abandonment is directionally consistent with voice UX practice, although exact abandonment thresholds vary by domain.
Contradiction / nuance: Some task-oriented systems can tolerate longer delays if they use appropriate fillers (“let me check”) or the user expects backend work. A strict 300 ms target for every complete answer can be unrealistic when external tools or complex reasoning are required.
Verdict: Agree, high confidence. The practical takeaway is to optimize perceived turn latency and first useful audio, not necessarily complete answer latency.

Claim 2: Cascaded STT → LLM → TTS is the dominant production architecture today.

Support: OpenAI’s own voice-agent documentation distinguishes model/orchestration choices, while frameworks such as LiveKit Agents and Pipecat are built around pluggable realtime media, STT, LLM, and TTS components. Together AI’s 2026 post on realtime voice agents similarly describes dedicated STT, LLM, and TTS endpoints with sub-500 ms end-to-end goals.
Contradiction / nuance: OpenAI Realtime and similar systems show that speech-to-speech architectures are already usable, especially for natural conversation and interruption handling. “Dominant” is hard to prove from public data, and some vendors hide cascades behind one API.
Verdict: Agree with caveat, medium-high confidence. Pipelines remain the safest default for debuggable production agents, but speech-to-speech is increasingly viable for selected workloads.

Claim 3: Streaming-native ASR is better suited to live agents than batch-oriented Whisper-style processing.

Support: Whisper was originally designed around fixed audio windows, making low-latency streaming require chunking/stitching engineering. NVIDIA’s Nemotron Speech Streaming model page and NeMo streaming examples describe cache-aware streaming ASR, matching Bhargava’s point about lookahead and cached activations.
Contradiction / nuance: Whisper-derived systems can still be engineered into usable streaming products, and quality, language coverage, deployment constraints, and cost may outweigh architectural purity.
Verdict: Agree, high confidence. For new low-latency voice-agent work, evaluate streaming-native ASR first, but benchmark domain accuracy before replacing a proven Whisper stack.

Claim 4: 8B–30B LLMs are a practical size range for voice agents because of TTFT constraints.

Support: Inference literature and vendor benchmarks consistently show that larger models generally increase latency/cost, while smaller models can reach lower TTFT with sufficient serving optimization. Bhargava’s range aligns with common production tradeoffs for fast tool-using agents.
Contradiction / nuance: Hosted frontier realtime models, speculative decoding, distillation, and specialized inference stacks can shift this range. Some workflows require stronger reasoning and may accept fillers or async responses rather than forcing all reasoning into a 300 ms path.
Verdict: Mixed, medium confidence. Use 8B–30B as a starting heuristic, not a law. The real gate is measured TTFT plus tool-call success on your workload.

Claim 5: Co-location can materially reduce end-to-end latency.

Support: Network round-trip time is bounded by distance and routing, and Bhargava’s London-to-US example is technically plausible. Together AI’s public realtime voice-agent materials emphasize dedicated, pre-warmed endpoints; broader low-latency systems practice supports regional placement and avoiding cross-ocean hops.
Contradiction / nuance: If model engine latency or tool latency dominates, co-location may provide only marginal benefit. Multi-region deployment adds operational and compliance complexity.
Verdict: Agree, high confidence. Co-locate once component latencies are already optimized or when global users experience avoidable cross-region hops.

Claim 6: Speech-to-speech models are the likely next generation, but tool calling and instruction following remain blockers.

Support: OpenAI’s Realtime API documentation explicitly supports speech-to-speech agents, tool calls, live translation, and interruptions, indicating the direction of travel. The architectural advantage is real: speech-to-speech can preserve tone and enable full-duplex interaction.
Contradiction / nuance: Public docs show tool calling exists; the stronger claim is about reliability in complex production workflows, which is workload-dependent and less publicly benchmarked. Some teams may already be successful with speech-to-speech for constrained domains.
Verdict: Agree with caveat, medium confidence. Treat speech-to-speech as a strategic prototype path, but require task-specific evals and auditability before broad production migration.

Screen-level insights

0:14 frame — sponsor/title slide. The visible slide lists “PLATINUM SPONSORS” with Braintrust, WorkOS, and OpenAI logos. It does not add technical evidence, but it places the talk in an AI engineering conference context rather than a product webinar.
0:44 frame — Together AI company slide. The slide claims Together AI has 1M+ developers, 300+ employees, 50 PhDs, 7 professors, and $500M+ funding, with a logo wall including Cursor, ElevenLabs, Mistral, Decagon, Cartesia, DeepMind, Writer, Cognition, Zoom, DuckDuckGo, and others. This supports why the speaker frames the advice around inference at scale and production customers.
20:14 frame — process/architecture slide during guardrail Q&A. The image is blurry, but it shows a left-to-right sequence of boxes/modules with a highlighted rightmost stage. Nearby transcript discusses adding classifiers/guardrails before or after the main LLM and the risk of invoking TTS before policy checks. The visual matters because the answer is architectural: every extra box is both a control point and a latency cost.

My read / why it matters

This is a useful production talk because it resists the temptation to frame voice agents as “just add speech I/O to an LLM.” The core engineering lesson is that voice agents are distributed realtime systems with ML components inside them. The teams that win will likely have strong evals, tracing, region strategy, and policy gates—not just a better demo voice.

The most important caution is that naturalness can mask unreliability. A pleasant voice that calls the wrong tool, speaks before a guardrail fires, or misunderstands a drug name is worse than a slower but auditable system. Start with the boring pipeline, measure every turn, and only collapse into speech-to-speech when your evals prove the simpler architecture is also safer and more correct.

Verification notes

Source/evidence audit: Main transcript claims were checked against the extracted transcript, especially latency targets, pipeline components, co-location, speech-to-speech tradeoffs, Q&A eval guidance, and guardrail placement. External checks included Together AI realtime/speech posts, OpenAI Realtime/voice-agent docs, NVIDIA Nemotron/NeMo streaming ASR references, LiveKit/Pipecat framework docs, and voice latency/turn-taking research summaries.
Transcript/comment/frame fidelity audit: No extracted comments were available; the analysis states that explicitly. Screen insights were limited to visible content from the three extracted frames and did not infer unreadable slide text beyond noting a blurry pipeline diagram.
Hallucination/overclaim audit: Claims such as “dominant architecture,” “8B–30B,” and “speech-to-speech blockers” are presented as practical heuristics or caveated verdicts rather than universal facts. Vendor performance claims are treated as sources to evaluate, not independent guarantees.
Actionable Insights audit: The top section includes concrete first steps, tools/frameworks, metrics, evaluation criteria, cautions, and links where available. Residual uncertainty remains around exact production adoption rates and workload-specific latency tolerance, which require direct benchmarking in the target environment.