Transcript: Engineering voice agents: latency, quality, and scale — Rishabh Bhargava, Together AI

Watch video

AI Engineer24m 35sTranscript ✅Added Jun 1, 1:51 am GMT+8

Source video ID: N7b1PJc7SFc

Transcript

0:14 — All right. Well, folks, thanks for being here. Excited to chat more about how to engineer voice agents, you know, high quality, low latency at scale. First, maybe a little bit about me. My name is Rishab. I work at a company called Together AI. I lead the voice AI team there. Prior to Together, I was the co-founder CEO of a company called Refuel that was acquired by Together last year. But generally been building AI and machine learning infrastructure for about a decade. For folks who maybe don’t know about Together, Together is building the AI native cloud. What that really means is
0:44 — for companies that are looking to train models and need access to reliable compute, or you want to do inference at scale, we’re probably a very good fit for you. We work with, you know, there’s a million plus developers. We closely work with hundreds of companies and, you know, very proud to be working with companies like Cursor and Deck again. Okay, here is the agenda for today. So, we’re going to first talk about, you know, why we’re talking about voice, although the previous speaker alluded to a lot of interesting things that he’s doing with voice. But why does voice really matter? What does it actually take to build voice agents at scale?
1:15 — What are the challenges? We’ll talk about this pipeline architecture, which is becoming the dominant way to build agents in production today. We’ll go a little bit deep dive a little bit. We’ll look at all the components. We’ll look at the system and trade-offs. And then finally, we’ll chat about maybe what might be the next generation of building voice agents in the coming months and years. Okay, so starting with why voice matters. Well, there’s billions of phone calls a year that are still handled by humans. I am pretty sure all of us have the experience of calling customer support,
1:46 — asking about the status of our order, looking to change the reservation. We’ve also probably had the experience of calling a doctor’s office because we got to book an appointment for ourself or a loved one. And pretty much everybody here probably has had the experience of being on hold. Now, it would be amazing for AI agents to be able to handle some of these calls for us. But really one of the more exciting directions is, frankly, voice is just this brand new interface to interact with systems and computers. Look, humans, we learned how to talk before we
2:16 — learned how to read, right? So, this comes very naturally to us. And obviously, we’re seeing this with ChatGPT’s advanced voice mode. And we’re seeing this with folks who are starting to directly talk to Cursor, talk to Claude code in order to get their work done. And this is, frankly, just the beginning. And one of the exciting pieces of 2026 is, you know, building these rich, high-quality conversations. This is not the domain of science fiction or research anymore. This is primarily an engineering problem today. Now, why is it hard? Well, there’s a few
2:48 — things that you got to solve first. First, building voice AI and building voice agents, this has to be real time. You know, when humans are having a conversation, we respond to each other’s cues in something like 300 milliseconds. And so, if you’re talking to an AI and it’s taking more than 500 milliseconds to respond, you’ll start to notice. If it takes a second, if it takes 2 seconds, people will just hang up. So, you got to get latency down. The next thing that matters is you want it to be a reasonably smart call. You
3:19 — want to get the work, the job done. And so, for real-world complex workflows, you know, the instructions are complicated. There’s a lot of ambiguity. You have to be good at tool calling because that’s the way you give agents access to the real world. So, you have a baseline level of intelligence that you got to meet. The third piece that you got to solve for is the voice has to be natural enough. It has to sound pleasant enough. And this means a lot of different things. It means, you know, can it talk to you in your own language with the right accent potentially?
3:49 — Can it pronounce your name? Can it deliver the right emotion that is needed for a particular situation? A lot of the things fall into that bucket. And finally, you know, you could switch together like a nice demo with one person calling. But what happens when you’re doing 100 calls, 1,000 calls, 10,000 calls concurrently? Reliability really starts to matter. And this is an and problem. You have to solve every single one of them at the same time, or you’re going to be in a little bit of a trouble.
4:20 — So, at least today, the dominant way of building these voice agents is this pipeline architecture or this cascading architecture, which attempts to solve all of the problems that I outlined earlier. Now, there’s a few boxes that are going on, but, you know, it conceptually it’s relatively simple to understand, which is audio chunks from an end user that are being streamed in, potentially to an agent orchestrator, something like a pipe cat or live kit or something that is homegrown. And then essentially, this audio is being fed into a speech-to-text system that
4:50 — converts it into text. That’s being fed into an LLM that then decides, do I do a tool call? What is the output? Produces text that is then fed into a text-to-speech model, which starts to produce audio chunks that are then streamed back to the end user. So, that’s a rough architecture. Let’s maybe look at like each of the components one by one, the components that matter here. The first is speech to text. You know, very much like the ears of your agent here. The The performance metrics that matter here, the first is quality, word error
5:22 — rate. Now, you know, depending on use case, the numbers might look different, but state-of-the-art models are typically in the 6% word error rate on open benchmarks. What that really means is the transcript that is produced by your model, comparing it with the reference transcript, that 6% of words have an error in them. Now, you can imagine why this might matter, right? Because if you don’t get the transcript right, you don’t get somebody’s name right, you don’t get the name of, let’s say, a drug right, essentially, there’s no way to fix this. Your LLM will make the will carry forward the mistake. The TTS model will
5:53 — carry forward the mistake. So, you have to get it right for the important keywords. And then the second metric that matters often, which is latency driven, is time to complete a transcript. The way to understand this is when somebody completes an utterance, they stop speaking, how many milliseconds does it take for you to complete the transcript and have that be ready for the LLM? And so, as an example, for some of the models that we run on Together, we we get sort of consistently P90 of like 100 milliseconds, which is pretty fast. Aside from just raw performance, there’s
6:24 — a few other capabilities that matter. Turn detection, very important, still somewhat unsolved problem, frankly. That could be a 20-minute talk in itself. But really, the best way to understand this is you’ve got people who are, you know, they’re talking. Maybe they pause for a second. But do you actually know? Does that pause mean their turn has ended? Are they going to continue talking? Because really, the last thing that you want here is for you to for the agent to start, you know, sending audio back and talking at this person even though their turn hasn’t ended. We don’t enjoy this
6:54 — in human conversations, and we will certainly not enjoy this in AI conversations. Depending on who your customers are, language matters, and so being able to do this for a wide variety of language and getting it right there, it’s important. And the final piece that I’ll mention, this is somewhat new, is we’re also starting to see architectures, model architectures that are streaming native. A little sidebar, we won’t spend too much time on this, but there’s an architectural evolution for speech-to-text models that is in progress, which is going from batch
7:24 — models to stream models. Whisper is the sort of canonical model came out a few years ago. Um, it was trained on 30-second audio clips. 30 seconds is way too much. You can’t wait 30 seconds to start to do transcription. So, people have had to build all sorts of complicated logic around models like Whisper to do chunking and to pad it with silences and then it make multiple calls, stitch that together to produce the final transcript in in streaming mode. But recently, and this is a fairly new model from the Nvidia team,
7:54 — instead, you have the encoder of the model have two interesting characteristics. The first is it’s trained with different amounts of look-ahead time. So, it only looks at perhaps 80 milliseconds or maybe up to a second of audio instead of 30 seconds. And it’s also able to cache these activations so that as you kind of make small steps in audio frames, you’re actually only doing the heavy computation once. So, again, just kind of stepping out, but, you know, it’s an interesting direction that we’re seeing to be able to handle streaming
8:24 — conversations for these voice agent use cases. Okay, so that’s speech to text. Jumping into the next part of the pipeline, which is LLMs. You know, very much the brains of of your agent. The performance metric that matters here first and foremost is streaming latency. And so, you know, TTFT is the metric here. And a rough kind of, you know, metric is like it would it’s usually pretty good if you can get to 300 milliseconds of TTFT because, you know, you want to start producing tokens, start feeding that into the TTS model as fast as possible.
8:57 — That number, 200 to 300 milliseconds, has implications for what models you can use. And so, a good size model typically ends up being in this 8 to 30 billion range. If you go any bigger, you’ll burn through your latency budgets. If you go too small, that has implications for the intelligence of the model and, frankly, the tool calling that is needed, which are both pretty critical if you want to build a voice agent that does meaningful stuff in the world. Okay, text to speech. This is very much,
9:27 — you know, the voice of your agent. There’s a few interesting things on on sort of performance and capabilities. Performance again, you know, the trend continues. What is the time to first audio, right? As you get a transcript, how long does it take to produce the first audio chunk that can start be start to be streamed back? And aside from TTFA, what is the real-time factor look like? Real-time factor is, and this is generally the case for most TTS models, but what it means is how much how much audio can you produce in certain number of seconds of processing
9:57 — time. So, if you can produce 10 seconds of audio in 5 seconds, your your RDF is 0.5. And so, you typically want that to be less than one so that you’re not buffering. Um quality is one of the hard ones with TTS because, you know, there are some objective measures, but frankly nothing quite beats listening to audio samples for the models for the voices that you care about and getting a feel for is this the right uh experience that you want your end customers to have. Some of the other capabilities, you
10:28 — know, it’s you know, naturalness across a different uh number of different voices, being able to pronounce things exactly right, whether it’s, you know, customer names, whether it’s product names, um being able to have some amount of control over emotions. And so, you might see TTS models that allow you to add these digital little tags which says, you know, this is happy or angry or sad. And it’s the start, but these models are getting pretty good uh at emotional control. And of course, coverage over language continues to matter. Okay, so those are the main components,
11:00 — but just to kind of zoom out a little bit. You know, this is still All of these components are part of this larger architecture which is multiple models at being being orchestrated. And so, the there’s a few things that we should always kind of keep in mind which is first, you know, there’s a latency and cost budget across these models that we’re thinking about. A rough rubric is the LLM is going to take up a majority of it followed by TTS followed by speech to text both from a latency and a cost perspective. Um and so, again, just kind of rough rules
11:30 — to kind of think about. And one piece that we didn’t mention and we’ll come back to this in a second is so far a lot of the the numbers that we’re looking at is just engine latency. How much time does it take the model to produce an output? But actually, when you’re calling models that might be sitting in different data centers, there’s network latency as well and that starts to have an impact. But again, we’ll come to that in um in one more slide. Auto scaling is also, you know, somewhat interesting and tricky to get right for for agent systems. Of course, you want
12:01 — to be doing auto scaling to scale up as demand goes up, scale down potentially, you know, night times or weekends you want to scale down seamlessly. Scaling up, you know, what we’ve typically seen as people are much more aggressive about scaling up because the last thing you want is, you know, request to be slowed down or backed up. So, you typically might auto scale earlier than you might do with somewhat more asynchronous systems. And scaling down is also tricky because you might actually have these stateful long-lived connections to your models. And so, you can’t just arbitrarily kill
12:32 — a pod. You might want to wait for conversations to kind of fully finish. So, some interesting nuances with auto scaling. And finally, you know, it’s like, you know, global deployments are important because you want your models uh you want your system to be as close to your end users to shave off latency as much as possible. And of course, you know, if you’re building sort of models in Europe or in places where residency matters, you want to be making sure that you have the ability to deploy wherever you uh we absolutely need. I know I referenced co-location. Um
13:04 — but here’s one way to kind of understand this problem. So, the chart on the left-hand side, this is very much, you know, a fairly optimized uh sort of system where you’re doing a pretty good job with your speech to text and text to speech and LLM models where the engine latency is in exactly the right ballpark. You’re doing 100 to 200 milliseconds of time to first token or audio. But you might actually end up having your models being sufficiently far away from your agent orchestrator that it’s taking 75 milliseconds of
13:35 — network latency. 75 milliseconds is really not that much. Like, you know, even in I think like, you know, US West to Europe would certainly be 75 milliseconds, but you know, depending on networking it can be much higher as well. And so, an interesting kind of direction that we’re seeing folks go is how can you co-locate all your models and potentially your agent orchestrator to either be in the same data center or be very very close to each other. How can you get them literally in the same building? Because that drop from 75
14:05 — milliseconds to five basically gets you a 30% reduction in already a fairly optimized voice agent setup, right? So, some of these things um especially again with real-time systems uh it’s just pretty important to have like fairly deep observability and, you know, every 10 milliseconds matters. Okay. So, hopefully that’s an interesting picture on sort of this pipeline architecture. Um but that’s not the only way people do it. One of the other kind of directions that is becoming interesting is a pure
14:37 — speech to speech model. And so, instead of having speech to text followed by LLM followed by text to speech where you’re sort of coordinating and orchestrating across a number of different models, um it’s just way simpler if you could have a pure speech to speech model that still is responsible for function calling, um still handles all of the complicated instructions, uh but just a single model doing it. And of course, you know, for folks who’ve played around with OpenAI’s real-time API, they have a single model behind the scenes. Uh Nvidia recently launched a model called voice chat,
15:07 — again, very similar ideas. The reason why most of these models are not used in production too much is because they have they still have trouble with instruction following and tool calling. So, you know, the real-world experience often looks like you’ll try them um and then you spend a lot of time just prompt engineering and hoping to kind of fix issues and, you know, eventually move to a pipeline architecture. But as these models get better, which I’m confident they will, uh it has some pretty incredible benefits
15:37 — because suddenly you don’t lose anything about the nuances of speech when that speech is getting converted into text. So, the model will natively understand what was the tone, what was the emotion, uh was the user hesitant? That that stuff will still remain with the model to make the next kind of decision. And this type of model also allows for um sort of more full duplex homi- like communication which basically means that the model can start producing audio while it’s still receiving audio. And this means that, you know, as your as a
16:07 — customer is talking to this model, you can back channel. You can say, you know, I see or aha like similar to what a human conversation would look like. And these models become much better at handling interruptions and bargins, which again, with a pipeline architecture you have to do a lot more complicated engineering work around. So, um hopefully this kind of points in the direction of like, you know, what might be coming in the future. But frankly, we have a lot of engineering work ahead for all of these kind of voice interfaces that still have to be built. Um
16:39 — if you want to learn more about what Together does, you know, uh here are a couple of links. We are hiring. We also have a booth G1 downstairs. Um happy to chat more and happy to answer questions. For voice to function calling use cases, what sort of um evals do you use? And then, what score do you need to get on those evals in order to have something that’s good enough for production? Um
17:09 — that’s like the classic answer is like it depends, you you know, but um I think in terms of evals like, you know, there’s of course like component by component evals. So, if we’re talking about sort of like the pipeline architecture um and assuming that speech to text is good, text to speech is good, and the only thing that we’re caring about is sort of function calling or tool calling evals on the LLM, then it’s very similar to how one might do evals for tool calling for LLMs broadly, which is, you know, was the um uh was the tool call correct? Was the output actually
17:39 — possible? There’s a bunch of those. One would imagine that, you know, you’d want the uh the tool call structure to at least be very close to 100% uh um and then the correctness, again, it depends a little bit on what does the use case kind of eventually demand. One of the other things that we’re seeing is especially to get around to make models better at tool calling and because we have to stay within that LLM budget of like, you know, the models have to be relatively small, we do see customers fine-tune smaller LLMs with their kind of use case
18:10 — specific data so that they can get tool calling quality to go up while remaining a model that is relatively small. Yeah. When you mentioned about co-location, so in that what I understand from that is you’re using you’re decreasing the latency in the network latency. Yeah. Um it’s literally because the the machines are closer to each other. So, you » that mean like if I’m using cloud providers or Uh good question. So, for example, um
18:42 — you know, let’s say you’re using let’s say you’re building a voice agent here in London, right? You have servers here, but you’re using perhaps OpenAI’s models for the LLM. And now, odds are that OpenAI’s servers might be somewhere in the US. So, literally the data has the the network hop has to be from your server here all the way there and back. Uh instead, if you had if you’re able to run, let’s say, an open source model in the data center that you’re running uh your voice agent, now it’s basically going to be intra uh sort of uh like uh
19:13 — the data center rather than going um let’s say over the Atlantic. So, that’s one way to kind of think about like and just distance literally kind of like has that big of an impact aside from other kind of networking related concerns. Thank you. Yes. Uh one more question. Um So, we had uh when trying this voice pipeline system that we had to introduce some guard railing. So, like, let’s say we have classifier model between uh which just checks that the model’s not offering 8% discount
19:44 — uh which it’s not authorized to do. Um How does it present to the pipeline and how you do that without compromising the latency and the the experience? Yeah, it’s a great question. So, you know, the question is like what if we have other models like God real models or other classifiers in the mix? How does this fit into this architecture? You’re absolutely right. Like this is the most kind of basic reference architecture that one might have. But in many production settings, there’s actually multiple models that might be in the mix. The God real or the
20:14 — classifier, we definitely see people kind of like start to introduce that right before the main LLM as well because maybe you want to check is this something that um something that goes to an LLM that handles refunds versus something that handles order tracking perhaps. And so you might have a classifier there. God reals at the end of the LLM generation before you produce a response, that also makes sense. And so often this ends up growing as the as the sort of like scope of what you’re hoping to achieve grows. And it puts real pressure on sort of
20:44 — like latency concerns and so forth. So, you know, no easy answers except that it becomes, you know, one more or two more components to think about. Have very clear sort of guidelines and SLAs on, you know, how much budget can you really kind of associated with them and then sort of independently scaling them as needed. But unfortunately, no no easy answers. Yes, especially because when you have things like like the the edge already Yeah. answer which was an answer which was already not what something it should be but the classifier the God reals were something
21:15 — Yeah. I should have lied. You can’t take back things that are spoken. You you might have to do something Hey, sorry. I shouldn’t have said that. I need to revoke that or something. So, this is really a problem I see there. Yeah, I I think that’s spot on. Like catching all of those before the TTS model gets invoked is is certainly important. One of the other things patterns that at least like we’ve seen is sort of this thinker talker pattern where you might have a small LLM that is handling all of the conversation. And so as it gets sort of text from speech to text,
21:45 — it produces a response and the response might look like let me think about it or let me get back to you. And then it basically issues one big tool one tool call to a much bigger model that then has, you know, much you know, has better instructions, has all the tools associated, maybe more God reals. And then it produces a much cleaner response that you know, the the model is much more the the architecture is like, you know, you’re much more comfortable with. And so that gets fed into the TTS model to produce a response. But you know, in some ways this is kind of the this is the kind of beauty of
22:16 — like all of these architectures is, you know, the components are, you know, you can we’ll only add components. And so it kind of, you know, pushes more on sort of reliability, having like, you know, sort of detailed kind of observability on every single component. Yes. Um um I was I was just um curious about the upcoming voice voice conversations coming out. And I know that this is upcoming but at the end of the day all of the
22:46 — surrounding infrastructure that we have in the conversational while we have conversational systems now and all of yesterday pipeline approach, how do we like how do you do evals in the product and observability? Like is there a need to transcribe everything? Yeah, um very good question. So, how does sort of observability, logging, evals change? Um So, in some ways so some parts of it can still remain the same. So, sometimes what you might have is aside from pure speech to speech, you might actually
23:16 — have a a transcription model that is running so that you can at least seeing the transcription allowed at the same time as the audio is being generated. So, that gives you some amount of auditability in terms of what audio is coming in and what audio is being produced. But yes, you know, some evals are going to change. There is no there’s no sort of real concept of sort of text to speech anymore. There’s no concept of pure kind of text to text here anymore. And so the evals become much more full duplex kind of conversation evals which is like a much longer conversation
23:47 — and then, you know, evals and metrics uh that are sort of focused on that entire conversation. All right. But what I mean is like is the nature of those models are like able to output the characteristics of the the conversation in a way that you can evaluate is it that native to the pipeline or is that something on top? Typically a lot of the evals that would happen on top on top of the base kind of inference API. I think I think we’re running out of time but Okay. Um thank you everybody for the for for
24:18 — being here.