Transcript: Building Agent Interfaces: Lessons from Chrome DevTools (MCP) for Agents — Michael Hablich, Google

Watch video

AI Engineer22:38Transcript ✅Added Jun 7, 11:51 am GMT+8

Source video ID: _B4Pv9ttFgY

Transcript

0:07 — [music] » Let’s get started in in in interest of time, right? So, hi. Welcome. Let’s talk about building agent interfaces today. So, let me start with a question first. Who in here is already using MCP servers or CLI tools on your uh agent Okay, everybody? That is unsurprising, to be honest.
0:37 — Um who in here have already built MCP servers and deployed them for effect? Okay, it’s approximately half of the people. Well, today I’m going to share four engineering lessons from the Chrome uh DevTools team on how we build Chrome DevTools for agents and how we deployed it for effect. Quick context setting. Chrome DevTools for humans is used by millions of web developers on a daily basis to debug web pages. It’s directly built into Chrome and developers use it to
1:08 — debug web pages. Find errors, audit it, performance profile it, and so on and so on. Uh right. So, now let’s talk a little bit more about Chrome DevTools for agents. So, this is a purpose-built Chrome DevTools, but for agents. How surprising. Um yeah, let me briefly show you how it works. So, you can see on the left side Gemini CLI and the prompt is being entered, and now uh Gemini CLI has the MCP server being configured, and that opens Chrome
1:38 — on the right side, and then does debugging. I think yes, this is about performance tracing. So, it does a performance trace, analyzes the trace that comes back or the performance insight, acts on it, and then makes the web page faster, uh validates that it’s actually faster afterwards, and it’s done. You should be done now. It’s nearly done. Sorry, it’s the video. Whatever. What I wanted to tell you is like, yeah, this is going to work in any MCP client and uh agent harness that is
2:08 — MCP capable. Doesn’t really matter. That was Gemini CLI, also works in Cloud Code, Codex, Open Claw, doesn’t matter. If you want to have more information, go to this QR code because that QR code is going to uh bring you to a web page, and that’s going to tell you everything about how you can install it, configure it for your agent harness. Question again. Who in here has already tried it out? Okay, so 10%? Thank you.
2:38 — I love you. I also love everybody else, but I love the others more. So, I was rude. I didn’t introduce myself. Uh my name is Michael Hablich. I am the product manager for Chrome developer tools uh at Google. And I’m also a guest lecturer at the university near nearby where I’m living. I have 20 years of experience in tech, developer, tester, QA engineer, project manager, product manager, program manager, and so on.
3:08 — If you have questions afterwards, please talk with me in the hallway track or uh connect with me over LinkedIn. Uh the QR code will bring you to my LinkedIn page. Both is fine. Uh please do that. I would be super interested to talk with you about MCP servers, browser automation tools, and stuff like that. Okay, but now enough advertisement. Let’s move on. We ship Chrome DevTools because we saw that coding agents were flying blind. So, like 1 and 1/2 years ago, they were very good with generating
3:39 — code, but they were not able to validate what they actually were doing, right? And that just sucked. But we assumed they are going to be fine if we throw a lot of data at them. Because I mean, they are machines, right? The thing is, we were wrong. So, this is a uh This is the head of a trace file. A trace file it has all the data about the performance profile. And this is a file like multiple megabytes of data, and this is like
4:10 — 50,000 lines of JSON, and we did throw that against common agent harness at that point, like 1 and 1/2 year ago, something like that. And without surprise, this is too much data for an agent for a model to actually reason about, and it blew through the context window. And if you have seen Matt’s talk about the dump zone, you are moving the agent into the dump zone at that point. So, I thought, okay, we built it wrong. That’s not going to work. We need to do something else. So, in that case, for example, what we did, our performance tracing endpoints,
4:41 — it can also return that for post-processing uh with other tools, but what it’s really doing is it’s returning markdown now and semantic summaries. Like, this is a example of a such a semantic summary, which just gives you information about typical performance metrics like largest contentful paint, IMP, and so on and so on. I’m not going to bore you about all the uh performance metrics. And we are going to talk about them anyway because they are a very good example of how this is uh working. Well, essentially, we didn’t force the
5:12 — agent to read the entire book, the trace, but instead we just pointed it at the right sentence, and this is the semantic summary. That works quite well. In the end, uh or in the beginning, agents are a different user class. So, that’s where when it’s kind of like it clicked for me, like, ah yeah, they are kind of like a separate user segment. So, how do we reason about that? The thing is, agents and humans, they share the intent, they share the goal, right? In our case, for example,
5:43 — both want to identify errors in a page and want to fix those errors. But they think differently. They have different cognitive bottlenecks, more or less. For humans, it’s a lot about visual complexity. So, humans are very typically very visual uh creatures, and we need layout, we need color in order to find a signal. And this might not be the best example for uh
8:09 — efficiency. Uh sounds very happy, and maybe it is. I don’t know. Um What is it about? So, effectiveness is about does the agent complete the entire user journey? Is the functional intent actually fulfilled? Yes or no. And then there’s efficiency, which unsurprisingly is about token cost, tool calls, duration. In the end, what tokens per successful outcome tell
8:40 — you is the fuel efficiency of your interface, right? And there’s a caveat because there’s always a caveat. Fuel efficiency is relatively worthless if you can’t reach your destination. So, that’s why it’s called tokens per successful outcome and not token per outcome. So, make sure that you actually also measure effectiveness, right? And there’s one more caveat. And this is
9:10 — you can’t measure that globally. I mean, you can do that, but you maybe you want to do that because it’s a nice metric, but it’s going to be tremendously different between different user journeys and task classes. So, don’t compare them globally, compare them within your user journey that they’re measuring that. Uh and what I mean with that is like, for example, in Chrome DevTools, we have the user journey of web scraping, right? So, an agent going to website and extracting information. That’s relatively cheap.
9:41 — But there’s also user journeys that are more intricate like debugging a website, finding out why the responsive layout is not working. That thing is going to use more tokens, but that is fine because it’s a much more much more intricate uh and more interactive session that’s happening. Okay, how does this look like in real life? Uh, this is what it looks like in practice and for a project from the internal project that we’re working on, and you see a lot of neon bars, which is great.
10:11 — I like neon bars. But, I am not going to worry about details. The important thing is the neon bars on the left side, the longer the bar, the more effective the tool. Right? The shorter the bar, the less effective the tool for the particular use case. So, each of those bars on the left side are about use cases. Uh, which means the smaller bars are probably the ones that we should be focusing on our work next, how to improve that, how to improve the tokens per successful outcome there.
10:42 — Yeah. Uh, as you might have already guessed, measuring tokens per successful outcome is not straightforward. The thing is, but even an imperfect measurement is better than simply doing gut-driven, um, decisions. And with that, at least you can do data-informed decisions. Right. Ian, sorry. Uh, I had my audio on. In DevTools for Agents, we’re
11:12 — addressing, uh, token burn from three different three different angles. First, there’s tool categorization. So, very straightforward, we hide hide niche niche tools behind command line parameters. Like, for example, uh, we have tools for Chrome extension debugging. And not everybody is developing Chrome extensions. So, why add it to the default context menu? There’s no point in doing that. Then there’s a slim mode. And this one is fun. So, what slim mode
11:43 — is doing is like, uh, pushing tool categorization to its limits. It’s only exposing I think three different tools. Select page, navigate page, and evaluate script. And this is great for your context window, but there’s a trade-off. I’m going to talk about a lot about trade-offs today. Um, there’s a trade-off because the less tools you expose, the less tools are also at the disposal for a for your agent harness, which means your agent might do extra turns to achieve the same goal. It might not
12:14 — actually have the right tools at the disposal to actually do something, like for example, getting network requests. You can’t do that with evaluate script. And stuff like that. Yeah, and there’s also a CLI interface. Uh, sorry, there’s a command line interface that we’re offering. Uh, you have seen the previous talk maybe about a code model and all that stuff. We also support that. So, it is a MCP server, yes, but there’s also, uh, a command line interface for the same thing, giving you nearly the same functionality. What it enables you, you can have your agent chain commands
12:45 — together to do post-processing. Like in this example, I don’t think you can see it. Um, the accessibility tree is extracted with a with a grep uh, command, and then the result, the ID of the control is being piped into a click command. And that is, of course, saving a lot of tokens doing that because the model doesn’t need to process all the tokens. The token post-processing is happening on your computer. Right. Um, efficiency is useless if your agent gets
13:16 — stuck. So, that brings us to error recovery. And yeah, because every time your your agent encounters an error, it’s going to cost you tokens because it needs to retry, it needs to understand what is happening and stuff like that. And that just sucks. Oh, yeah. Error recovery is a spectrum, and let’s talk a little bit about that, what we are doing here. So, first, of course,
13:46 — you should add uh, useful error messages. That sounds obvious. Uh, for a lot of tools, it isn’t. And it was also not obvious for all the tools that we actually offered, so we also did, uh, a few iterations on them to actually make the make the error messages good. Like here, for example, uh, an unable to navigate back in the selected page for a particular tool history entry to navigate, uh, was not found. We actually added the last sentence, and that enabled the agent to self-heal, which is super useful because then the agent doesn’t need a human to actually fix the
14:17 — problems, but the agent can, uh, self-fix the problems. Then there’s proactive detours. So, beneath each of the agents, there’s a model, and the model is being trained on certain data. And sometimes, the, there are things where you want to counteract the training data. And that’s what you can do with proactive detours. Like in this example, um, we detour the agent for performance profiling to our start performance trace tool and not to the Lighthouse audit.
14:49 — And there’s diagnostic playbooks. Uh, so, we also offer, uh, skills, of course. And we have a skill that’s called, uh, troubleshooting, and we see a lot of people have problems setting up the Chrome DevTools MCP server correctly, and that troubleshooting skill is then going to kick in and help the human and the agent to fix the setup issues. Again, enabling self-healing of the agent. And all of this increases the resilience of your product, uh, of the agent
15:20 — harness that you’re building. And this is nice and helps you with telling the mistakes. And now, let’s talk about discoverability, which is about actually preventing them. So, our initial design had one monolithic tool called debug webpage. So, we only had one tool, debug webpage. And you could the another agent could send a prompt there and tell it, “Hey, debug this webpage. There are some responsive layout that’s not working.” And it was neat from an engineering perspective, but it didn’t really work.
15:51 — Uh, so, we decomposed it into 25 different tools. And did we solve the problem? Did we solve it? Of course, it wasn’t. Because we traded that problem off to another. And that was agents now had 25 tools at their disposal. How are you going to find out which one to use when? Well, let’s talk about that. According to this paper here, uh, 97% of MCP tool descriptions have quality smells. And this matters because
16:21 — the schema is the UI for the agent. So, let’s make the UI better. And fixing this is a trade-off, as I said. It’s always a trade-off. Because, of course, you can make the descriptions better, and that’s going to increase your context window size. So, you probably don’t want to have that. Or maybe you want. And also, smaller models in particular are not that good with more descriptions because they get biased in using tools they shouldn’t be using in the first place. There’s a trade-off space. Uh, read the paper. Super interesting.
16:52 — Uh, yeah. But, there are a few things you actually should be doing that are relatively uncontroversial, and that is, um, define purpose. Clearly explain how what the tool core function is. Is this working? Ah, yeah. Here it is. Come on. Yes. Ah, it’s a Okay. Clearly explain the tool’s core function, and there’s usage guidelines, like provide clear activation criteria. And how does it look like in, uh, Chrome
17:23 — DevTools for Agents, for example? And, uh, again, performance start trace tool. What we have in there as a as a description is used to find performance, uh, front-end performance issues and core web vitals, LCP, INP, CLS. Why is this relevant? LCP, INP, CLS are web performance metrics, and an agent is able to make the connection, “Oh, I’m going to use that tool if I need to, uh, improve page load, for example.” We are far from finished optimizing that because models and agent harnesses also
17:53 — keep on, uh, changing all the time, so it’s kind of like an endless quest for minimum viable description. Yeah, but it is what it is. You can supercharge all of that with skills. As I said, we also have skills, and that is great, uh, in particular if you have more intricate workflows. But again, there’s a trade-off. They are not free lunch. If you pile in too many skills, uh, you’re going to to shift the problem and run into the same problem again. Uh,
18:23 — agents are going to call your skills even if they shouldn’t be calling them. Uh, your context window size is going to increase and all that stuff. The trade-off is shifting, it’s not disappearing. Okay, now we have optimized for cost, recovery, and discovery. Let’s now talk a little bit about trust because you don’t want to have a backdoor into the system. Chrome DevTools for Agents has a feature called autoconnect, and it’s kind of like it lets you, the human, using your coding agent, like Claude Code, share the screen with the agent, like,
18:54 — “Hey, I’m stuck here. Uh, please help me debugging the debug that and fix the problem that I’m seeing here.” Amazing feature. I really like it. Um, and users, of course, requested the feature, “Hey, why do I need to click allow all the time? I don’t want to do that. Please remember my choice.” And in a traditional user experience design, that would have been a clear win, right? Because it’s just fixing the friction that you want to remove. In a world where you are delegating away work to agents and automating away
19:26 — agents, you need to think about trust boundaries. And so, that’s why we actually designed it, uh, with so that friction is actually by design because we didn’t want to have that. And why? Let’s talk about that. There’s a blog post from, uh, Simon Willison about the lethal tri factor. QR code. You should read it, it’s great. Uh, I’m not going to talk more about that. And utilizing that, there is a free uh, at least three tiers that I’m thinking
19:56 — about in browser a browsing agents more or less. You have tier one and that is the local development environment. In the local development environment, you have the human in loop and the human wants to grant access to the default Chrome profile to the data that you already have access to uh to the agent in a time bound manner. And then you have uh tier two and tier two is uh agents running in continuous integration environment. So, it’s controlled environments, but they’re separated away. At that point, you should be using data separation things like containers, of course, but also other things like
20:28 — separate Chrome profiles and stuff like that. Um if you want to connect to them, we also have a mechanism for that and that is called uh remote debugging port. And third, there’s agents with full internet access and that is YOLO mode essentially because every webpage out there is able to do some prompt check chain text to your agent. So, make sure that it do the same thing as in tier two, but also in tier three, uh make sure that they have the domain allow list and don’t prompt injection
20:59 — mitigations, all that stuff together. Going back to the leaf trajectory factor, that’s what we mostly reason about uh tier one and that’s where all those three things are coming together. So, that’s why we actually say no, the human actually need to consent every time. Key point being is a local agent, tier one, and a browsing agent fleet, tier three, where you’re research agents maybe, might share a tool like Chrome DevTools for agents, but they shouldn’t share nothing else else about your security model that you’re having,
21:29 — right? Okay. Let’s Let me wrap it up. User experience is evolving to incorporate agent experience. An agent is just another type of user, segment of user, also with non-functional requirements. Efficiency, discoverability, security, stability, and so on and so on and so on. I shared four takeaways from Chrome DevTools for agents. Um when we are implement while we are implementing that, that is measure fuel fuel efficiency of the interface with
22:00 — tokens per successful outcome. Turn errors into recovery playbooks. Audit descriptions for intent and never compromise trust for convenience. Agents are our next users. Let’s help them help us. And with that, I wish you a nice remaining conference. » [applause] [music]
22:32 — [music]