Transcript: Deep Dive: How to Monitor AI Agents in Production
Source video ID: 5yXLZTIqBsU
Transcript
- 0:00 — Welcome to our webinar. We will we will let people trickle in, but today we will be talking about production monitoring for agents. Um so this is the second one that we’ve done recently or at least that I’ve done recently. Um the first one was more focused on some of the testing and evals and debugging of agents. And now in this one we’re going to be talking pretty heavily about monitoring agents in production, which we think is a bit of a different
- 0:30 — uh story and different different kind of like part of of the process. Um mild mild housekeeping as uh people roll in. Uh this is this is being recorded and and so we’ll put it up later. Um there is uh a chat uh where um you can ask uh questions and chat with folks, but more importantly, there is a Q&A section. So if you go at least on my on on my screen, it’s over on this like more side
- 1:01 — of the of the thing. And if you click there, you can see kind of like the the open um the open questions there. Um cool. So hopefully everyone can share my or can see my can see my slides. By the way, my name my name’s Harrison, co-founder CEO of of LangChain. Um uh in addition to the open source stuff, which we’re known for, we do a lot of work on observability and evals and monitoring agents in production. And the reason we spend so much time on on
- 1:32 — agents in general is that they’re really unpredictable and they’re and they’re hard to get working. And we talked a bunch about kind of like debugging and evaluating them in the previous webinar. Um and then uh and and and then this one’s really focused on uh what happens post kind of like launching them to a bunch of users. And one of the things that we’ve seen as we’ve worked with a bunch of companies is that you don’t really know what your agent will do until it’s in production. Um and so so what what do I mean by that
- 2:03 — and and why is that like interesting? So traditional software is generally um predictable as as as far as things go or at least compared to agents. So it’s deterministic. You you run you run you run code in the same order every time. Um the code is is the source of truth. You know what will happen by kind of like looking at the code. You can debug things by identifying failed functions in a stack trace. And then if you and then if you take those same things and and and bring them local and run it locally, it should it should run the same. And
- 2:34 — you monitor with kind of like traditional APM tools. And compare that to agents. Um where there are few sources of of of difference between agents. Um so so agents are non-deterministic. They’re also not very robust at all. So in addition to being non-deterministic, if you have if you have if you change the the words um slightly, you get pretty drastically different results as well. Um code uh code tells you what the system prompt is and the tools are, but it doesn’t tell you what the agent actually
- 3:04 — does or what it did. And so that’s why we like to say like traces are the source of truth. So so traces really tell you what what’s going on. Um debugging um debugging for agents, what does this look like? Well, you’re debugging reasoning. So now you need to understand the decision context at each step. What uh what what what decision was made? How is that made? It was generally made because there was certain context that went into the LLM. And so what context went went into the LLM? Um and and and in order to do this, you
- 3:35 — need to track threads of conversations. You need to track the inputs and outputs of tools. You need to track the reasoning context. And then from all of that data, then you can start to do things like product analytics and and surfacing errors and things like that. And so this is one big way that we think uh traditional software is is is different from agents. Um and a a second big way is that agents have an infinite input space. So you know, if if uh if you chat with a chatbot, you can say anything to it. And
- 4:06 — and and that sequence of words is passed into the into the agent and passed in as tokens. And so there’s this um there’s a theoretically like infinite number of of different combinations of tokens that you can pass into an agent. And if you compare that with software and and clicking around, when you click around, there’s generally like a pretty fixed set of actions that you can take. If you have three buttons on the screen, you can click those three buttons. If if there’s four, you can click four. And so like agents have a much more uh larger space of of of inputs into into uh into
- 4:40 — them. And and so that makes them, you know, you you you if you combine these two things, that makes it really hard to know what’s going on in agents. And that’s why production monitoring for agents is is is different and should be treated differently than production monitoring for for for software applications. Um and so instead of just instead of just monitoring system health and by the way, you you should absolutely still monitor system health. You should still you should still use uh a legacy tool like Datadog or something like that to monitor kind of like what’s going on
- 5:11 — with the CPU and what’s going on uh with with RAM and stuff like that. But um you you also want to monitor the interactions themselves. Um so you want to you want to monitor what the agent is doing. And in order to do that, you need to capture these kind of like prompt response pairs. Um so for for each uh LLM call, you want to capture kind of like the full prompt that went in. So the system prompt and the human messages and everything. And then you want to capture the response. And then you want to capture the multi-turn context as well. So a lot of agents have a human in
- 5:42 — the loop in some form. So even chat is very human in the loop. You message it, you get a response back, you have some back and forth. And so you want to capture these these backs and forths so that you can understand what the context was. And then as the agent runs in the background for a lot of steps, you want to capture the trajectory, which is basically the sequence of tools that it that it called and in what order, and then any intermediate steps as well. Um one one thing about agents is they often, you know, agents often do open-ended and ambiguous uh work.
- 6:15 — Um so uh like they write essays, they write code, things like that. Often times, in order to judge that work, you need human judgment. Like what what is a good essay? What is a good summarization? Um those are those are, you know, those are pretty is a joke funny? Those are pretty kind of like human things. Um and so a big part of how we think about LangSmith is bringing human judgment to these traces to to these actions and help giving some evaluation there. The issue is that human judgment doesn’t really scale at production uh volume. So
- 6:46 — if you have a big massive traces, looking at every single one is not super feasible. Um one of the ways uh there’s two ways that we’ve uh decided to tackle this. One of the ways is around annotation queues. So annotation queues still use humans to to kind of like look at things and label things. Um but they’re set up kind of like dedicated views and you can route specific traces there. So you might only choose to look at ones that get a thumbs down or something like that. Um or some other signal that there’s something kind of like uh anomalous about about these
- 7:16 — traces. Um and then we try to make this as easy as possible. So you can define these review rubrics uh that evaluators should follow as they’re labeling things. Um you can have multiple people review them. So you can say, “Hey, each trace needs needs two people to review or something like that.” Um and then and then you can also turn these traces into evaluation data. So one of the things we’ll be talking about a bunch uh next month and the month after is this like data flywheel you can get going. And so I won’t spend too much time on that today. But but this this flywheel of
- 7:46 — trace to evaluation data is really important. The other way that we think about scaling human judgment is by using LLMs to approximate it. Um so LLMs uh you know, have have some measure of of of intelligence. Um and and and if you get them calibrated with what you as a human think is funny or think is concise, then then that’s great. Um and so one of the things that we have in LangSmith is we call online evaluators. These use LLMs as a judge. They run automatically on production traffic, either on all runs
- 8:17 — or a sampled subset. Um because you just have like the the the execution, you don’t have a ground truth. So you can’t test for correctness, but you can test for kind of like coherence or tone or things like that. Um safety and compliance is another thing you can test for. So whether it contains sensitive information or or violates any policies or things like that. Um if you have a specific output format that you want it to follow, um you can also test and assert that it that it follows that. Is it is it two paragraphs as a response?
- 8:48 — Or did I only want it to be one sentence? Um and then you can also use LLMs not for like these scoring. And and we’re thinking by the way, and I’d love suggestions from folks. Um evaluators um of like we we call these online evaluators. I think evaluators is actually a terrible name for this because they do way more than than evaluators. So what I mean by that is you can use them to classify like the types of questions that people are asking. So we have a chat LangChain instance. Are people asking about LangGraph? Or are they asking about deep agents? Or are they asking about LangSmith? We you know, we we don’t
- 9:19 — know. We we use LLMs as a judge to kind of like classify these things on these three dimensions. So I I I think evaluators is actually kind of a bad name for this and we’re looking for a better name. So if people if people have suggestions, would love to hear that. That being said, um LLM-based evaluations are are not, you know, a a walk in the park park. Um the what one thing even before you get into some of these is just like aligning them with what you want in the first place. You know, these LLMs don’t always do what you want them to do. And they
- 9:50 — don’t always know what you want them to do either. And so one of the things that we have in LangSmith is is evals as a way to basically align these evaluators to your preferences. Um so we help you build evaluators in in in some sense. Um even after that, there’s still a bunch of challenges that you’ll run into. Um so evaluators uh evaluators take time. And so this is why people run them kind of like as online e-vals and not in the loop. So if you have kind of like a chatbot with a user, um you could
- 10:20 — instead of running this in the background. So so maybe just clarify like in LangSmith, what happens is you run your agent comes in and then you run an LLM as a judge evaluator over it. An alternative thing you could do, which we call like in the loop e-vals, is like as the agent is running, you could run uh you could run an evaluator there and then based on that, you could either return it to the user if it’s fine or or pass it back for another loop. Now, these evaluators oftentimes add latency and that’s why it’s this tradeoff of hey, should should I run it in the loop um or should I run it offline? And
- 10:50 — depending on the situation, you’ll you’ll choose different things. Like if if if you’re fine with your users waiting a little bit longer and you need to have, you know, accuracy or uh coherence or something like that, then what you’ll do is you’ll just put it in the loop. Um if if if it’s a higher priority to get something back fast and this isn’t like a uh you know, absolutely critical thing that you’re testing for, then like online evaluators make a bunch of sense. Um cost, like these these LLMs can cost a bunch. We recommend using smaller
- 11:20 — models. Um specifically uh we uh well, so to help with this as well, we also recommend just doing kind of like classification. Like is this is this is this cohesive or is this not cohesive? Rather than trying to rank uh uh cohesion on a scale of 1 to 10. The reason for that is that models are just better at that and that lets you use a cheaper model, which can drive down cost. But even then, we recommend kind of like sampling like a percentage of traffic. Um and then I talked about this a bit before, but accuracy of these
- 11:51 — evaluators is a big thing. And then even if you get it working one day, you know, the next week, the inputs might have drifted. And so and so you might need to recalibrate it. And so this like calibration of evaluators is is an ongoing thing. So, um what are some of the tools for production observability? So one of the things that we’ve built in LangSmith is Insights Agent. Um and so Insights Agent is useful for a few reasons. It’s it’s really useful for
- 12:21 — figuring out almost like the the unknowns of of your agent traffic. And there are generally like a lot of unknowns in in your agent traffic because there’s so many different things that could come as input and there’s so many different trajectories that the agent could run through. So there so there’s a bunch of these like very open-ended things that you might want to just discover things about. You maybe don’t have pointed questions, but it’s like what are people using my my agent for? So that’s what we call like usage patterns. Um and so here you can actually see, I think this is is from a coding agent um
- 12:53 — uh a a LangChain coding agent. Like what are people about this coding agent to do? Developer experience or platform things or indexing or front end or security. And so we can see a breakdown by topic here. Um error modes is another one. So where is the agent making making mistakes? Um and and that’s kind of related to like edge cases, which is like a kind of a combination of the two, which is like what are unexpected queries that users are sending that you didn’t account for. So this is useful. The way that this works um under the hood is we is we basically run uh a smaller LLM on on on
- 13:26 — each of the data points to extract a summary along with some relevant info depending on what you’re looking to um get out of it. And then we do a bunch of clustering on top of it to come up with these clusters. And so actually, if you clicked into this cluster, there would actually be subclusters. So we cluster things at two levels. I mentioned this already, but like online evaluations are pretty crucial for continuously monitoring the quality of of your agents. Um and so in order to set these up, you basically need to decide which traces to evaluate. Do you evaluate all traces? Do you randomly
- 13:56 — sample traces? Do you evaluate traces that took longer than 10 seconds? Do you evaluate traces that got a thumbs down on them? So you need to you need to kind of like decide that. Um and then what to evaluate. Um so do you evaluate for quality? Do you evaluate for safety? Do you check format? Uh or is or is there some kind of like other criteria that you look for? Um when to alert. And so this is this is the last thing as well. One of the things in the platform that we have is alerts. And so alerts are basically thresholds that trigger notifications
- 14:26 — when things drop below a a a certain point. Um one of the one of the questions we get asked is like okay, so like there’s these philosophical differences around why agent uh observability is different than traditional observability. But like what are the practical differences? Like why can’t I use a traditional kind of like APM tool? So there’s three main reasons that we see. One is the one is just due to the nature of the data that you’re capturing. So like agent traces are
- 14:56 — large multimodal payloads. Um and in order to ingest them and query them efficiently, uh a different type of database and different type of database ingestion patterns are needed. And so we spend a lot of time on that. Um and so this is this is one place where uh where where where other tools will kind of like fall down a little bit. The second is connectivity. Um so so building agents is really iterative. You can you can take these you can take these traces and use them to turn into e-vals as I mentioned
- 15:26 — before. We’ll be doing a really big uh push on this in the next uh two months. But um at the at a really high level, like building agents is is very iterative. And so you need it you need these traces to be connected to other parts of the app, whether that’s the playground or the prompt hub or the e-vals or the annotation queue or things like that. And so these everything that I mentioned is pretty like agent specific. They don’t exist in in other general purpose tools. And so that connectivity is the second piece. Um and then the third is the users. Um
- 15:56 — so uh the this whole like workflow that I’m describing is is really uh meant to be kind of like cross-functional spanning like product and engineering and and and domain experts that are in the platform working to improve improve the agent. And that’s a very different persona than than infra teams. Um and and so the platform in terms of like the user-friendliness of it and what you can do there and the onboarding experience all needs to be different as well. Um and so these three reasons are
- 16:26 — are why we see people choosing kind of like a specific LLM or agent observability tool as opposed to a a standard observability one. Cool. So um yeah, a lot of a lot of what we’ve said around kind of like online e-vals and even insights has really been um uh uh focused on this idea of monitoring things in production and like what what is the difference between this and kind of like traditional testing? So another And by traditional testing, I mean like
- 16:56 — offline e-vals, so building a data set and running and running the agent over it and then scoring it, which is a pretty common practice that we see people doing. So so these are all ways that that um you and and by the way, you you should absolutely do that type of offline e-val. Um we did a webinar on this on this last month or a month and a half ago. Um and and uh that that’s a key piece of of what it takes to build confidence that your agents are functioning in production. Um but there’s other things that you should do as well. You should monitor the quality of these agents as
- 17:26 — they actually are in production. Um so great, test them before, but monitor them real as well. Sometimes the production traffic drifts from from your uh offline e-val data sets. Um you should you can tag uh what what what’s happening in production. We talked about this, but e-vals are a bad name, but you can tag these um you you can tag these traces with specific things. You can evaluate trajectories as well. So like uh typical testing kind of like has an expected answer and then you just compare the output to the expected
- 17:56 — answer, but it’s also really interesting to think about what the what the process to get that answer was. Um was how many tools did it call? What path did it go down? And so this is what we call trajectory evaluation. Um and then security and safety is one of the things that we’re thinking a bunch about. Um so uh do do these outputs or inputs contain kind of like sensitive information or policy violations or things like that? In this vein of monitoring a lot of a lot of this uh data, um another thing
- 18:26 — that we’ve had in LangSmith for a while, but becomes even more powerful as you go into production, is dashboards and alert. Um so like by default, we set up dashboards for the number of LLM calls, the tokens, the cost, the the tool calls including kind of like latency and errors and costs as well cuz you can track that. Um any feedback scores and um those those I think those may be the main things. Um you can also set up kind of like custom dashboards if you have domain specific insights that you want
- 18:56 — to track. Um so here it looks like we set up a a custom Here’s the screenshot of setting up a custom dashboard where I have the support assistant agent I want to track kind of like this specific kind of like metric with this specific filter. Um and then I and then I can have a dashboard that that has exactly that. Um you can alert when when thresholds are kind of like crossed either via webhooks or via PagerDuty integration that we have. And then the really cool thing that you can do is you can click into specific spikes and see the traces associated with that. So here, you know, there’s a spike. I click in and I can see the five traces
- 19:26 — that caused that spike. Um this uh this is uh oops, sorry. Um this is what I was kind of talking about a little bit before when there’s this kind of like um connectedness of the platform. So production monitoring should feed the improvement loop. Um you should have production traces uh where you capture everything that that kind of like uh happens in production. And and and and in behind the scenes
- 19:56 — there’s there’s failures. There’s educations. You these go into annotation queues where you can review and label them to make sure that you’re flagging the right things or to give the ground truth or to give feedback on what should have happened. Data sets can then be built from these from these annotation queues. And so now you have these data sets which contain basically guidelines for what what what should happen. You then you then test you make some changes to your agent and and you and and you run experiments to see to see if that fixed it. You you then push a change to to production
- 20:27 — deploy it and then you run some online evals to make sure that the the change kind of like you know actually took effect as you were as you were expecting and hoping. And and and then you know there’s and then there’ll be more things that pop up in this flywheel kind of like repeats itself. I mentioned this before a little bit when we talked about persona but just to reiterate it like agent observability is for more than just engineers. So if you’re listening to this or if you’re in this audience and and and you’re not an engineer but you have opinions on how agents should be built or how they
- 20:58 — should behave like great. There’s almost certainly kind of like getting agent aligning aligning your agents to work well is really hard and there’s a lot of different things that make up that process. So like you know AI engineers product managers subject matter experts data scientists » [snorts] » they all skills from all of those disciplines are are really relevant. Really like you know AI engineers I’d say are like a hybrid between engineers and product managers and data scientists and subways.
- 21:28 — Like understanding how the agent’s being used is like a very product manager thing. Understanding like the whether responses are good or not like that’s a domain expert. Data scientists are often great at kind of like analyzing patterns or building evaluations or running evaluations. And then and then often times like making sure like the the prompt engineering or the context engineering now is is done by kind of like the AI / ML engineers but but in order to do this you you do need kind of like background
- 22:00 — or some knowledge of all of these things. Um Oh sorry just to just to repeat some of the open challenges that I would love feedback from people here if if they have kind of like ideas on on how to on how to handle but like some of the things we’ve seen you need evaluators that work well. This is this is the big thing. How how do you get LLMs as the judges that actually work? We we have some stuff with the line evals there. This is a big focus for us. How does this how does this cost scale? If you’re running over millions of
- 22:31 — traces what does that look like? We’re very excited about open source models and cheap fast models for this reason. And then how do you do this in a secure and compliant way? LangSmith is is SOC 2 compliant. We can be HIPAA compliant. We we also run in self-hosted environments for some of our larger enterprise customers. And and this is I want to I want to end on this slide because this is a hint of where we’ll go in the next month or two but
- 23:01 — like we’ve we talked about kind of like debugging and testing before you go to production. We just talked about kind of like production traces but production traces aren’t the end. They’re going to be the start of this continuous improvement loop and and we’ve already hinted at some of the stuff that go into this but this is where we’re we’re really going to kind of like make sure drill down make sure this works and and add more things on over the next few months. So I want to end there and I think we got about like 20 minutes for questions. I saw a few coming into the into the chat and into the QA.
- 23:32 — So I’m going to start in the QA which again for people who don’t know if you look down on the bottom QA is all the way on the right. There’s eight things there now. So I’ll add some stuff there and I’ll try to get to them. Um What is the most common mistake you see teams making when they first set up production monitoring for agents? Um I mean the first the first mistake is not setting up production monitoring for agents so or like relying on like
- 24:02 — um traditional software things for that. I think like yes again you should absolutely know like like straight up errors and things like that. So so maybe one mistake would be like just focusing on like errors. So like if you just focus on errors you’re missing a lot of like um places where the agent didn’t like straight up error but produced a response that made the user frustrated or something like that. And so I think probably yeah the one mistake would be just like not not setting up
- 24:32 — kind of like the observability the specific agent observability for this. Um another that that’s that’s like the like if I was like so like what would I do? First I would set up that um not not actually looking at the traces is another is another thing as well. Again that doesn’t scale a lot. Um but having having a way like if a customer reports an error having a way to find that trace in LangSmith and then going
- 25:03 — to look at that manually is super super valuable. And so if you don’t have a way to associate traces with customer errors or if you’re not spending the time to like look at that trace in detail those those would maybe be two other mistakes. Um » [clears throat] » Your chat.linkedin.com is pretty robust when it comes to controlling the infinite input. When I ask it non-LinkedIn question it steers me back to LinkedIn pretty effectively. How do you design guardrails for agents and make sure you’re you’re you’re looking
- 25:33 — to cover everything. Everyone’s using Chipotle’s AI as Claude Code nowadays. Yeah good question. I mean so there’s there’s two specific things that we do in chat.linkedin.com and it’s open source so you can check it out. One is just prompting. You know we prompt the agent to not respond to kind of like LinkedIn questions. We I think we have some tests for this and evals for this internally. But the the other thing that we do is we actually have a we do have an explicit guardrail. So we do have an explicit guardrail that checks whether the the thing is LinkedIn related or not and if it’s not LinkedIn related it just short
- 26:03 — circuits. And so we do actually split these out and and have an explicit guardrail for that. Um Can we gather traces and evals pre-formatted to improve the overall system? EG export traces to use as context for model improvement. Yes. Yes this is a really good question. I’ll I’ll try to drop it in the chat but or maybe I’ll just share my screen. Um We did launch
- 26:33 — LangSmith CLI which is used for pulling traces down. And so you can give this to Claude Code. You can give this to Codex Cursor. It has a bunch of different things that you can do. So you can you can query traces you can query individual runs query threads manage data sets manage evaluators look at experiments. And so you can install it with this. Um you can install it with go. This is this like
- 27:03 — this helps really speed up that feedback loop because coding agents are so good now. This is this is something that we’re investing a bunch in. Um Sometimes I sometimes wish LangSmith could evaluate my own usage of LangSmith. Any considerations around making LangSmith eagerly suggest ways one might not be using it. Yeah this is a good question. There’s there’s a lot of things in LangSmith. Like we have a line evals we have we have few shot evaluators things like this. We are thinking about ways to more proactively suggest things.
- 27:34 — Right now the interim step that we took is we added Polly which is a chatbot into LangSmith that you can ask questions to and it’s hooked up to our docs so it should know things and it can also look at like what’s in your LangSmith. But we do want to be more proactive about suggesting things when you when you when we think that you’re doing something that could benefit from another part. For online evals in agentic systems with branching trajectories what’s the best strategy to use a single global LLM as a judge that understands all possible paths and expectations or do you dynamically select specialized judges with scoped expectations based on the
- 28:04 — observed trajectory? If so how to execute this judge dynamic selection? Yeah generally we see people setting up they have like static evaluators and then they just define filters that those run on. So for example like if you if if you if an agent has a tool that’s like you know uh ask an astronaut or something like that and you want to evaluate whether it’s asking the astronaut like reasonable questions or whether it’s
- 28:34 — always it’s always you know should is it always asking a teacher before it asks an astronaut? I’m making up a silly example but like what you can do is you can you can write an LLM as a judge that that checks for that. And then you just set up a filter and you can we have really good filtering in LangSmith. So you can set up a filter where it where it only runs over traces where the ask astronaut tool is called or something like that. Obviously that’s a bit biased because what if like what what if you’re looking for kind of like false negatives where it should have
- 29:04 — called the tool but didn’t. So so it gets a little bit tricky but that would be the way to do it. You set you set up a specific evaluator and it’s always that evaluator but then you set up a filter so it only runs over a subset of the places where that’s actually useful. Is there any option to align LLM judges to submit feedback collected in an automated fashion? Not yet. This is this is something we’re working on where if where if you just have people label stuff we will we will kind of like automatically optimize the eval prompt for you. That is coming soon.
- 29:34 — Should evals be written at the start of building AI applications? Evals are a continuous thing. They should be written all throughout building an AI applications. I think it’s really helpful to come up with a base some some evals at the start at the very least. Like, what do you expect people to ask and and how do you want the agent to answer? That can often help be kind of like a product spec for the agent as well. Um, but those e-vals need to continue to grow as you build the agent and you learn more things about how your users use it and stuff like that.
- 30:04 — What do you think the best way to implement offline e-vals for agents that involve complex abstract, e.g., with complex business logic, blah blah blah, human loop interactions? Is open claw the only way? Um, I’m not 100% sure about the open claw tie-in. I haven’t seen people use open claw for e-vals. Um, generally, uh, generally what we see for a bunch of like human in the loop interactions is basically, um, the most common thing is just having a set of a set of a set of different
- 30:34 — inputs that you’ll send in one after another. So, you’ll have like a list of five inputs and maybe the first one is like, “Hello.” And then the second one is like, “Um, what’s my order?” And then and then the third one’s like, “Can you refund that order?” Or something like that. And this is like simulating a conversation. For the most part, we don’t see people doing simulations. We see people hardcoding things. Um, this is this is this like multi-turn stuff is definitely the hardest thing to to to get right. And so, we mostly see people starting with just like a single trace e-val first.
- 31:04 — Is there a way to recursively feed traces that violated policy back to agents without human input? Um, for the for the way, uh, for the sake of like improving the agent. I mean, what you could do is you could you could, you could you could set up a basically cron job that runs once a day, pulls down traces from LangSmith, filters for things that that violated policy, and then and then uses that to update the agent’s code or updates the agent’s memory or something like that. And this
- 31:35 — gets a bit This is like the North Star of like continual learning or something like that. And this is this is this is a direction that we’re we’re looking in. That would probably be the way to do it. I think it’s still, um, we don’t see a ton of people doing that like totally automated in the loop yet because it is just a little bit it’s tricky to get right. What’s the road map for the upcoming new features? Um, we I don’t want to give too much away, but we have Interrupt, which is our user conference, um, in May, May 13th and 14th in San Francisco. We’ll be launching a bunch of
- 32:05 — new features there. A lot of them are in this vein of kind of like this flywheel of improvement and how can we help there? Um, for when you want to chain the human messages and system messages into a graph, would you recommend logging per agent input output or per chained input output? Which way is better for continuous improvement? Um, I’m not 100% sure I understand this question. For which way is better for
- 32:35 — continuous improvement? For continuous improvement, you you really want to have just like this this list of messages of of of of what happened. Um, so, as as long as you have those in in some format, that that should be enough to be able to tell, “Hey, what was the input? What were the steps that were taken that make up like the tool messages?” And what was and what was the output? And then what was the human input in the next turn? So, as long as you’re logging those messages in whatever format, that that would be the best way to do it. Um, from my point of view, the event was
- 33:05 — great. It would be nice to have some hands-on practice in the future as well. Um, good feedback noted. Yeah, we can we can take a note of that. Um, we we do have LangChain Academy, which is our online thing. Um, we will also be running workshops at Interrupts. Um, but we can also try to do some webinars that are more hands-on. Um, that’s uh that’s actually really good feedback. Um, I have a question about agent context management. How can we improve that because I don’t feel like we have much
- 33:35 — control between complex agents, what to pass? Are there any features coming regarding this like deep agent to manage context and intelligently point info for some node and giving exact knowledge the node needs? Um, I I think the trend for this is just is is giving the the agent more and more control to manage its own context, to be honest. So, that would be So, I think file systems are an example of this. And so, deep agents, which you mentioned, have have file systems as as as basically a way to manage context. They can read context from there. They can write context from there.
- 34:06 — Um, uh, when and then when talking to sub-agents, they can use the file system to to pass files around. And and so, I think a lot of the stuff I’m thinking about deep agents is in this vein. And a lot of the trends of harnesses in general of agent harnesses in general are like giving the agent more control to to manage its own context. Have you ever tried something along the lines of agent as an evaluator instead of LLM as an evaluator? Yes. Um, yes, this is this is useful for when you want to evaluate, um, like if if an agent creates a bunch of
- 34:37 — files and you want to evaluate the files that the agent created. Um, that being said, we don’t see this a ton in practice. Um, but uh and the place where we do see people doing it is often times actually doing it as like a critiquer in the loop as the agent’s running. Um, so, we’ll call like a check tool and this check tool will will be an agent itself. Um, we don’t see a ton of of it in kind of like online monitoring. Um, although maybe we’ll start to see more.
- 35:07 — In terms of state, memories, and long-term storage of history, is there any recommendation or best practices of how to append or improve based on an output evaluated information with dealing with improving agent system prompt? Um, the at a high level, like in order to prove the agent system prompt, what you probably want to do is take a lot of the feedback, take a lot of the trajectories, use some LLM to suggest changes to that prompt, just like you’d suggest changes to a file. So, we have Fleet, which is our no-code thing that comes with like built-in memory. Um, and
- 35:38 — uh and and that’s how we do it under the hood. We just treat memories as as files and then and then uh then it just updates it that way. How do you design an evaluator that covers many different use cases like it like the deep agent? A single prompt for analyzing the whole trace often times doesn’t work and it’s hard to give it human input on how it should act in a rubric. Especially for online e-vals, seems like having a single prompt is a must since we don’t know the rubric in advance. Are there ways you have explored on defining a set of golden standard rubrics and the online
- 36:08 — evaluator automatically pulling in rubrics to aid its evaluation? I think often times we see people evaluating for like very specific things. Um, so, like is is is the user frustrated? Is like a very specific thing that you can evaluate kind of like the conversation on. Um, does uh does the does the agent call um uh search tools too many times in parallel or something like that? Um, I’m making stuff up a little bit, but like yeah, deep agents by themselves are are
- 36:38 — deep agents are very broad, but like most people use deep agents to build kind of like a specific vertical kind of like agent application. And then there’s usually like specific focused kind of like evaluators for for that application. Um, yeah. LLM as a judge with open models for cost at scale? Absolutely. I think this is one of the places where open models can be really really exciting. Um, and we’re doing a lot around open models right now. When would you recommend putting the LLM as a judge into the original agent’s
- 37:08 — runtime loop, um, rather than using it afterwards? Um, it it’s pretty product specific. If you if you can afford the latency, um, if you can afford the cost because at that point you can’t really downsample. You’re doing it at at all points. Um, and um and and and basically and like if the if the ROI of the latency and if the ROI of the of the cost are high enough, and that generally means that you’re testing for like really important things. Um, and and the users like find paying more
- 37:38 — money or find waiting a little bit longer. So, it’s mostly kind of like product considerations. How have you seen e-vals play into compliance in heavily regulated enterprises who have a lot of safety and audit concerns? Any examples? Yeah, I mean, um, uh, there’s specific kind of like safety and specific compliance things that you can test for. But more generally, like people just want to have confidence that the agents are doing what they’re supposed to be doing. And that’s where like well-designed e-vals can really convey like the set of like what the agent does and is able to do and what it
- 38:09 — should be able to do. And that’s a really good mechanism for communicating with others is to like show them the e-vals and show them the scores. Is there a schema {slash} docs or a skill for LangSmith traces? Trying to automate CD with an agent analyzing traces every 24 hours improving improvements. Struggling. Traces up any ended up being long as a deep agent. Yes. Um, yes, we’re we’re thinking a bunch about this. There are skills for LangSmith. Um, LangSmith skills on GitHub. Let me share my screen.
- 38:42 — So, there are a bunch of skills here, um, for for using LangSmith skills. Uh, and we will add um for this specific thing that you’ll you’re talking about, we’re we’re we’re working on a skill specifically for that as well. So, you can probably use some of them these here to get started, but we’re adding some that are actually really specifically targeted at what you’re asking about. Any thoughts on a LangChain hackathon? We sponsor a bunch. I don’t know if we’ve done any kind of like first-party ones. Maybe we should do one for the for the
- 39:12 — for the user conference we’re doing. That would actually be a good time. Um, we thought about it last year, but didn’t do it, but maybe this year. Do you think there’s a need to log and evaluate LLM as a judges to log and evaluate LLM as judges {slash} harnesses? Um, you should Anything that has an LLM in it, you should always be logging and evaluating it. So, yes. Are there recommended tools or ways to set up data sets to handle tool calls that are not item potent, right? Uh, yeah. Yeah, this is a really good question. Um, so, we have we have like pi test and
- 39:43 — just integrations um, and so a lot of the a lot of the a lot of the test for more complex agentic things we see being run with those test runners. Um, and and basically what you can do there is you can you can you can mock out a lot of these things. Um, so you can mock out calls to a database. You can also like what what what harbor does which is a code execution thing is they basically give it um, a fresh environment every time. They run it in a docker image. It’s got its own file system. So if you write files
- 40:13 — it doesn’t it doesn’t kind of like contaminate anything. Um, those would probably be the two suggestions. Like use mocks or or run in some sort of like sandboxed environment basically. Do you think we could implement deep agents for vertical AI use cases for implementing context graphs? Um, deep deep agents absolutely can be used for vertical AI. There was just a launch yesterday of Moda a design agent very vertical agent very focused on design. Um, and and we’re doing some more case
- 40:43 — study and we did a case study with them. Um, let me try to find it on our blog. If you if you go to our if you go to our LangChain blog, let me share my screen. Um, if you go to our LangChain blog, it’s one of the most recent articles that we wrote how Moda built production production grade AI design agents with deep agents. So I’d check that out. I think I think absolutely they can be used. Given the increasing use of sub agents and agent teams also run patterns, what is the best way to trace the link and evaluate the collaboration between agents UX wise? They are not currently
- 41:14 — easy to navigate on any platform. Yeah, good question. Where where we’re revamping our basically trace view right now to hopefully make this make this better. Um, if if there are specific kind of like pain points or specific really nested traces that you want us to take a look at and make sure they render nicely. Feel free feel free to get in touch on on Twitter. That’s probably easiest. Shoot me a message. We’re revamping this right now. Does LangSmith support evaluation metrics? Absolutely. We let you bring
- 41:44 — your own. We have some we have some built-in ones. We also integrate with like Ragas and other things like that. Can you talk about sandbox? I can’t wrap my head around when to use it. Um, that’s that’s probably a a little bit of a different question than this webinar, but yeah, sandboxes are great for running untrusted code. So if you’ve got an agent writing code, you you want it to do it in its own kind of like isolated environment. And that agent could be writing code either because it it wants to produce code or because it’s
- 42:14 — writing code to accomplish some other mission whose mission isn’t code at the end of the day, but code is a way to get there. So like if I want to like look at all my emails, I could write some code to like call the Gmail API and iterate through them or something like that. Um, and so we just launched LangSmith sandboxes in in private preview. You can you can try it you can try it out there. Um, there’s a few other ones, but I do need to go. So want to thank everyone for attending. Hopefully this was helpful. Um, a lot of good questions around the
- 42:44 — continual loop and of improvement and also um, the the viewing of these long complex agent traces. Those are both both both these are both things we’re actively working on. So so if people want to reach out or get in touch, um, would love to get your feedback on some of the initial stuff that we’re launching there. Thank you all for coming and hopefully see you at the next one.