Transcript: Ship Real Agents: Hands-On Evals for Agentic Applications — Laurie Voss, Arize

Watch video

AI Engineer2:04:18Transcript ✅Added May 19, 2:40 am GMT+8

Source video ID: Xfl50508LZM

Transcript

0:14 — Hi everybody. Uh my name is Lori Voss. I am head of developer experience at AriseAI. Uh in a former life, I co-founded npm Inc. So some of you may remember me from when I used to talk incessantly about JavaScript. Now I talk incessantly about AI. Uh what I think about mostly is how to test AI systems, how to make AI systems that actually work. Uh so we’ve got a good long stretch of time today. Uh so we’re going to cover a lot of ground. Uh we’re going to start with the fundamentals. Uh what
0:44 — eval are uh why you need them and why agents make evaluation harder than a simple LLM call is. Uh and then we’re going to set up tracing uh which is how you capture the raw data that you need to run evals in the first place. Uh we’re also going to uh run a simple AI agent with the claw a agent SDK uh and look at the traces that it produces. Once we’ve looked at the data uh we’re going to do something that most of the tutorials skip. Uh we’re going to
1:14 — actually look at the data. We’re going to read our traces, categorize what went what what went wrong, uh and figure out what to uh measure before we write a single eval. Uh then we’re going to write three kinds of evals. We’re going to write uh a simple code eval, deterministic, uh cheap, easy to run. Uh we’re going to use some of Arisa’s built-in evals. Uh and then we’re going to use uh LLM evals. uh where an LLM uh
1:46 — judges the semantic content over the output uh and judges whether things have succeeded or failed in a more flexible uh non-deterministic way. Uh we are going to use the built-in eval and we’re also going to write a custom eval from scratch. Uh and we’ll also test whether our judges are judging correctly, a process called meta evaluation. Uh and then we’ll finish with data sets and experiments. This is how you iterate on your agent. This is how you imp use evals, not just to tell you whether things went wrong, but to improve your
2:17 — agent. Uh, and we’re going to do at the end a brief look at what comes next after you’ve got this stuff down. So, practical frameworks you can use after you leave this room, the impact hierarchy, the data flywheel, uh, and a quick tour of techniques like, uh, pair wise evaluation and reliability scoring. Um, this is uh the slide that I left up while you were waiting. Hands up if you intend to actually attempt to code along. Wow, impressive. Also, just impressive
2:48 — how many people are in this room? Like this is the, you know, last session of the day and it’s eval. And do you know that it’s about eval? This is like it’s not the most exciting topic. I’m very proud you’re here. Uh, this workshop is about evaluating an agent though. So, it’s not going to be about building an agent. I’m kind of assuming you’ve already built an agent and you’re having problems with it going off the rails radically. So the uh learner notebook that you’re going to get there has an agent already built for you. We are not going to spend any time on how do you build an agent and what makes an agent good or bad. We are going to be
3:18 — evaluating this agent that I’ve already written. Um we’re also going to be using Claude both to power the agent and to power the evals. Uh I picked claude because everybody seems to have switched to cloud in the last couple of months. like I’m hoping you have a cloud AP a cloud account already so you don’t have to sign up for an API key right now but if you do uh now is the time um you can if you want to use open AI or Gemini with our stuff it is completely open source it is uh open standards it will
3:49 — work with those as well uh but for simplicity I have not uh included any examples of how to use open AR or Gemini in this stuff um you will also need a free Phoenix cloud account Fenix is open source and will run on your laptop, but Fenix Cloud doesn’t involve installing anything, so it’s the easier way to get up and running faster. Um, Fenix Cloud is where our log data will be sent. Uh, so you’ll need a Fenix API key to uh get permission to send it there. Um,
4:21 — hopefully you’ve got these links to these slides because I’m going to advance them in a second. Um, and really you only need the one to the notebook because the notebook has the links to everything else in it. uh who is at this point completely confused and needs some help because I’ve built in some time for that. Feel free to ask questions uh and and don’t don’t worry about it. I’ve built in time for asking like I don’t know what this is. I’m accidentally on the wrong product. Uh speaking of which, uh Arise
4:51 — has two products. One is called Arise AX and one is called Arise Phoenix. We are working with Arise Phoenix today. If you go to our homepage, there’s a big button on it that says sign up and it will sign you up for AX, which is the enterprise product, and you don’t want to do that. If you’re completely confused about where your API API key goes and what your host name might be, it is because you’ve signed up for the wrong product. So, make sure you are signing up for Phoenix. Uh, so without with that stuff out of the way or at least while you’re frantically still installing things, uh, let us
5:22 — cover the basics. What is an eval? Who in the room hand show of hands feels like they already know what an eval is? Like really knows. Okay, great. You’re in the right room. Uh, so one of the things that bugs me about evals as a field is that we use a whole bunch of words that come from MLAN. We use a bunch of jargon that comes from M MLAN that AI engineers do not need to know and do not understand and makes it needlessly opaque. So you can think of traces, you can think of ebals as being
5:54 — tests because that is what they are. And the things that power our tests are log data. And the log data we call traces. Um so just as logs record what your server did at runtime, traces record what your AI did. Every agent call, every tool call, every LLM invoc invocation, all of the inputs and outputs from your AI application uh are recorded as traces. uh and the building blocks of traces are called spans. Um each span represents
6:25 — one step uh in the execution. So an LLM call is a span, a tool call is a span, uh a full agent turn is a span that contains other spans inside of it. It’s this nested data structure uh that you know can be easily imagined as JSON because it is JSON a lot of the time. Um each span records the its input, its output, but also a bunch of metadata. Uh so things like timing and token count token counts uh stuff that you need to understand what went on. Uh and that is your mental model today. You are writing
6:55 — you are writing tests for a brave new world of applications that are very hard to test. So it’s not rocket science. Uh why do we need them? Well the reason we need AI eval is because of the vibes problem. Uh a lot of people build an AI feature and test it by running a few queries and sort of going does this look right? Um, then you ship it and it fails on inputs that you didn’t test. It fails on edge cases. It fails on people being adversarial and putting bad stuff in there that you weren’t expecting. Uh,
7:25 — and it fails most often on people asking questions that are dumber than you were expecting. That is the the primary way that agents fail is people using vocabulary that you weren’t expecting uh to hit an agent that is expecting a bunch of nouns and verbs that those people don’t know. Um, so the usual fix unit test doesn’t work here. Uh, and the reason that’s true is because the same prompt will produce different text on every single run, but those outputs, those different outputs might all be correct. There’s a huge uh,
7:57 — space of potential correct outputs. So you can’t just have basic string matching like most unit tests do uh, to make sure that your tests are running successfully. So a lot of teams fall back on human review. They watch it, they run it, they ship it. uh and that doesn’t scale because it doesn’t catch regressions and most importantly it doesn’t run in CI. So what you can’t do without V evals is you can’t change your system prompt to fix a tone issue because the tone might get better uh but suddenly the bot might be hallucinating product features uh and
8:28 — without evals you won’t catch that. Um, without eval catch that because until a user report uh reports it with but you can do something called a faithfulness eval uh which will tell you whether or not your bot is uh actively using its source material correctly. Uh and we’re going to show you how to build one of those today. Um and every prompt change potentially affects every kind of input that users send. This is another way in which uh
8:59 — eval uh AI agents are different is if you change one thing about your prompt, it doesn’t just change the thing you were trying to change. It changes everything that the agent does because it the agent absorbs the prompt as a block. Uh and will do unpredictable things as you ch make even minor changes in wording. Uh so you don’t want to fix one thing and break another. Uh and eval give you a way to uh test that everything is working the way that it is supposed to including all of the things that it used to do. Uh you also can’t
9:29 — switch models without eval. And this is very important because the uh big model labs come out with a new model every couple of months. Uh and the models are meaningfully different in how they you know they’re not just better, they’re different. Uh so a prompt that worked for sonnet 4.5 does not work for sonnet 4.6. Uh and if you don’t have evals then that’s a very expensive and cumbersome process where you have to test everything that you tested previously again to see if it works. Uh whereas if you have a suite of evals you can just run a regression and know whether or not it is safe to upgrade.
10:01 — Um and this is not theoretical uh real teams that are shipping real AI agents that people use all the time. Dcript bolt anthropic itself with cloud code this is how they do it. Uh they ship they all followed the same pattern. They started by shipping fast and checking things with vibes and discovered that it didn’t scale and moved on to uh formal evals. Uh so that’s the arc that we’re going to follow today. Like I mentioned, there are two types of evals. There are code evals uh which are
10:31 — deterministic functions you write yourself. They’re just you know basic Python or TypeScript. Uh they run in milliseconds and because they are just code uh they cost basically nothing to run. Um, you can do a lot with code eval. If the output of your agent is supposed to be JSON, you can test whether it is JSON. You can test whether it is under 500 tokens. You can uh you can ask you can check whether it mentions the thing that you asked about. Um, the big advantage of code eval is that they’re super fast and super cheap and totally reproducible. So, if you
11:02 — have something simple that you need to test, you don’t need to throw an LLM eval at it every single time. you can get quite far with things that look basically like old style unit tests although there is a lot of subtlety to how to how to uh put them together that we’re going to cover today. Uh the downside of code ev valves is that like unit tests they’re going to be brittle. When your output becomes complex or uh extremely nondeterministic uh you’re going to find that you need to move to the other type of evals uh which is LLM as a judge. Um
11:36 — with LLM as a judge use a second a second LLM to judge the output of the first LLM. Usually you use an LLM that is more powerful than the one that you put into production. um you grade your outputs against a rubric. Rubric being another one of those terms that the ML engineers snuck in there. A rubric is just a set of rules. It can be it’s just a prompt basically uh that defines all of the things that you need to uh test. Um and because LLMs are p LLM evals are
12:07 — powered by an LLM, they are a lot more flexible. So uh you can say was this response factually accurate? You can say did it stay faithful to the source material. You can ask is the tone right for a customer inter interaction. There’s no there’s no kind of unit test that you could write that would check for tone. But an LLM is very good at that kind of stuff. The strength of LLM judges that they understand meaning uh not just basic strings. But LLM as a judge has trade-offs. Uh the biggest weakness is that they are expensive uh
12:38 — both in terms of time and actual money. um and they are nondeterministic themselves which means that your LLM as a judge can itself be wrong. You have to prompt your LLM to be a judge and that prompt is just as complicated as the prompt that you were putting into your application itself. So you have to spend some time making sure that your uh that your LLM as a judge is aligned that your LLM as a judge is actually judging what you intended it to judge and that it does that in a way that you believe is correct. Um there’s also a third type of
13:08 — eval which is human evaluation. Human evaluation is the gold standard. It is great. Humans are the best at judging whether things are good or bad according to hum other humans. Uh the problem is that humans don’t scale. You need this stuff to run in CI, you need it to run thousands of times a day. Uh there is no way even if you work at you know Meta or Google and you can hire you know the population of a small country to do your evals for you that it’s going to be cost-effective to have humans doing your evals for you. Uh so uh what we use
13:38 — humans to do is to build what we call you’re going to hear about this later a golden data set of known good answers that you can then judge your evals against. Uh, a fun fact about human annotators is that about 50% of the time, even if you are using humans to judge your the output of your of your AI application, uh, they’re going to get it wrong 50% of the time because if you hire somebody to do nothing but test your code, they’re going to get tired. Uh, and uh, just fatigue makes them miss things roughly 50% of the time. But Kod
14:10 — evals, human evals, and LM as a judge, they are not competing approaches. They are complimentary approaches. A real eval suite is going to use all three of these at the same time. So the question becomes when do you use which? Uh use code evals when the answer is deterministic. So format validation, length limits, forbidden phrases, required fields, all of that kind of stuff you can very deterministically check. Uh uh and then use LM judges when you need semantic understanding. So like I said, a correctness eval asks did it answer the question accurately? uh a
14:41 — faithfulness eval asks did it stick to its source material and only its source material when you were answering that question. Uh and you can keep you have to keep humans in the loop uh for failure modes you haven’t seen before. You have to make sure that your evals are being run in a way that humans believe is real and believe is correct. Uh because LLM judges can be wrong. Um, and all of that stuff is true of any application that uses an LLM for anything, but agents make it even harder because agents have cascading failures.
15:13 — Uh, and in an agent, you have not just one thing that you are testing, but a series of things. You have an agent that that can take any number of paths. And an early misstep on one of those paths can uh lead the agent to radically incorrect directions. So if you imagine uh asking an e like if you have an eval that has sorry if you have an agent uh that has is making tool calls uh you have to test uh whether the agent picked the right tool whether it sent the right
15:43 — parameters to that tool whether it correctly understood the output of that tool and that’s just adding one tool call whereas an agent can run multiple tool calls in a single session uh and each of those tool calls relies on the output of the previous tool call. Uh so it gets very very complicated very very quickly which is why you generally need to throw an LLM at the problem. Uh uh and then there’s multi- aent systems. If you haven’t made things complicated enough uh you get additional levels of
16:13 — complexity because then you have to test did my routing LLM uh choose the right sub agent before it got before anything started happening and then did my sub agent correctly understand what it was being asked to do? Did it pass the information back correctly? Did it stop when it was supposed to? Uh, so all of this stuff stuff builds up and cascades. Uh, an example uh would be you ask your agent to uh write a report on Tesla. Uh, the first agent that does the research
16:43 — is like, “Oh, I guess you must meet Nicola Tesla.” It writes you, it sends you a whole bunch of information about, you know, the 18th century inventor. uh and then you output an investment case that is based entirely on uh you know whether Nicholas Tesla was a good guy uh and that gets forwarded to your boss uh and nobody notices because the agent was just doing this all autonomously that is the cascading failure that we are trying to avoid. Uh but agents can also do the opposite. Uh they can get things right in a way that you weren’t expecting. So one of
17:14 — the hazards of writing evals is writing evals that are too prescriptive. You don’t want to write an eval that says I’m expecting it to call tool A and then tool B and then make decision C uh and then get the answer because the agent might find a way to be more clever than that. Um this happens a lot in production. The agent will find a loophole or it will find uh a faster way of doing what you were of doing what it’s doing. Especially if you’ve just upgraded uh your model, it will get better at doing things that it was doing before. It will do them in fewer steps. So your evals will break if they are too
17:45 — prescriptive. Um uh yeah and that’s a hard problem. Um there’s another way to categorize evals into two groups into two groups which is uh capability evals uh versus regression evals. So uh a capability eval is giving your agent a hill to climb. You’ve given it something that it is that you know that it’s bad at and something that it’s going to mostly fail at. uh and you are giving it a loop where it can get
18:15 — progressively better at passing this capability eval. Once it’s hit 100% on a capabil capability eval, a capability eval becomes a regression eval. You build that into your into your test suite and you make sure that it can always do the things that it used to be able to do and you give it a new capability eval where it climbs a new hill and adds a new functionality. So in the life of your eval suite, you’re going to be constantly turning capability evals into regression evals. Uh this is what an eval result looks like.
18:47 — Uh it’s very simple. Uh you can imagine a JSON object with this stuff in it because that’s what it is. Uh you get a score uh a label uh that is human readable. Uh and LLM judges add a third thing. They add an explanation. Um, code evals don’t produce explanations because they are just code. You didn’t build an explanation into it. But LM evals uh are extremely helpful in that they don’t just say why it was that something was wrong. They say why they think it was wrong. As we’re going to see, this is extremely helpful when you are building
19:18 — your eval suite. Um, because it can tell you what was wrong. It can tell you what was missing. It can tell you what the agent should have done differently. It will give you hints as to how to make your prompt better, which is extremely helpful. Uh here’s what an explanation from a real eval looks like. So uh in this case the uh prompt was to the user was asking for budget travel recommendations uh for Tokyo. Uh and the LLM as a judge has said okay well they
19:48 — gave travel recommendations but they didn’t specify how much everything cost. Uh so if I’m ask if I’m looking at budget travel uh that’s a failure. That’s a really subtle distinction, right? Because you if you were saying, you know, did it give me travel recommendations to Tokyo? It absolutely did. Uh but it didn’t follow everything I asked it to do. It didn’t follow the subtlety of the original request. Uh and that explanation is what makes eval actionable. Uh you know what to fix
20:18 — in the prompt because you have an explanation that says this is what went wrong. uh and more usefully when you run evals across thousands of traces uh or thousands of spans you will find uh patterns you will find uh that your agent fails on the same kind of problem in the same kind of way lots and lots of times and that gives you a hint as to what is a one-off failure you know just the agent being weird and nondeterministic versus what is a systemata systematic failure which is likely to be a problem with your prompt.
20:49 — Uh, of course, if you’ve run uh an eval against, you know, a thousand traces uh and you’ve got a thousand explanations, then you have you as a human have to read a thousand explanations, which provide which uh creates a new problem, which is reading a thousand explanations and categorizing them all is kind of a pain in the ass. So, you get a third LLM involved. Uh, and you get the LLM to read the explanations and turn them into categories uh until it’s LLM’s all the way down. Um
21:19 — so this is the full loop that we’re going to build. Instrumentation is where you start. That is where you start capturing the data. Uh you get traces, you evaluate them, you annotate your uh evaluations, you analyze the results. Uh and then you change your prompt, you change your application, you improve your application, and you start over. So let’s get started with the actual eval. Uh, who who wants to be set up with Phoenix is not yet set up with Phoenix?
21:51 — Okay. What do you need? What where are you stuck? Or are you not? » Okay. Sorry. Great. Excellent. Everyone is set up with Phoenix as far as I can tell. Or they’re not brave enough to put their hand up and say they aren’t. Uh, so what is Phoenix now that I’ve been talking about it for, you know, 20 minutes? Uh, Phoenix is ARIS’s open-source AI observability platform. What it does is it captures your trace data. It captures those logs that we’re generating. So, every LLM call, every tool call, every agent step uh with the
22:22 — inputs and outputs at each point. Uh, and it also captures and stalls your stores your evals and gives you a UI for examining all of that. Uh, and uh, it also gives you a way to run uh, experiments on that data. Uh, so that you can test how things are improving and make your application better. Uh like I said earlier, you can run it locally on your laptop if you want to, but we’re using Fenix Cloud today because that involves not installing any software. So let’s look at our actual notebook.
22:52 — Uh the first thing is the pip install. We’re installing the cloud agent SDK. Like I said, we are installing open inference instrumentation cloud agent SDK. This is the magic. every uh SDK, every framework that is out there has a uh integration package written by us uh that automatically instruments uh and starts collecting logs from that framework. This is why uh Arise Phoenix is so useful is you don’t have to dig
23:24 — into the bowels of your framework or dig into the bowels of OpenAI or Claude or Gemini uh and get it to send logs to you. the people who built that software have already decided on this uh open- source uh standard called open inference uh and they’ve plumbed in all of those log lines already. So all you have to do is tell it to start sending those logs to a particular place. Uh and that is what we’re going to do in a bit. Um so let’s move on to uh the very first call. Um
23:56 — we’ve got uh those uh API keys. My API keys were pulled in via Collab Secrets. Uh, but you can just paste them in because no one’s going to see your screen, right? There’s only you in this room. Uh, and then these two lines of code are the magic. You import phoenix. Uh, and you call register. You give it a project name. That’s how you know what name it’s going to show up in inside of uh, Fenix. Uh, and you tell it auto instrument equals true. And that is the magic command that tells it dig into the
24:27 — internals where you already know how to do stuff. Uh and turn on all of the logging. Uh it’s complaining at me in this error message uh because it is using a a a span processor which is good for demos but not good for production. In production, you’re going to get thousands of spans at once. So you want to use a batch processor. Uh but this is not that. So uh it doesn’t matter in this particular instance. Um Oh yes, we also brought in Anthropic uh
24:58 — up in the imports uh because we’re using Anthropic to do the actual uh to be the LLM that runs the LMS as a judge, which is why we have to bring it in separately to the agent SDK. Uh the cloud agent SDK, if you haven’t used it, is a very very simple framework for building agents. That is why I picked it because it’s nice and simple and it doesn’t have a whole bunch of uh ceremony. Um you can uh the agents can use the can use tools, they can search
25:28 — the web, they can maintain conversation across turns, which is really all I need my agent to be able to do for the purposes of this example. Um OpenAI has their own agent SDK. Uh and of course there are whole agent frameworks like Crew AI, line chain, master, lama index, uh all those sorts of things uh that will let you build lot a lot more elaborate agents. uh and we have instrumentation packages for all of them. So whatever you have used to build your agent, you can do the same two lines of code and just turn it on. Um
25:58 — so uh you can go ahead and run that call. Um » question. » Sure. » I was cannot find where to generate the API. Can I do that? Go to » happened to me. Um, » go to spaces. » Yeah, you got to initialize one space beforehand. » Just launch the space and that’s what you want. » Thank you.
26:28 — » Great. » Question actually. » Sure. » What’s the Fenix collector endoint? » The Phoenix collector endpoint is the thing that I called the host name. Uh, » so it’s um » it’s app.fsyou username. » All right. Uh, anybody else need help on that kind of stuff? Cool. It is warm in here. Um,
26:58 — uh, so let’s look at the actual agent. Today we are building a financial analysis agent. Uh, a very very simple agent indeed. It has just two sub aents. Uh, it has one agent that does research. if you give it a stock ticker or a set of stock tickers. It’s going to uh search the web for information about those companies. Uh and then it passes that information to uh the second part of the agent uh which writes a concise financial report about that research. Uh obviously a real financial research
27:28 — agent which is a thing people build in production all the time uh is going to have a lot more elaborate databases to refer to. It’s going to have a lot it’s not just going to be doing web search. uh and it’s also going to be able to write much more complicated reports. Uh but that is fundamentally what we’re going to ask it to do. Um and here is how the uh two turns connect. You can see uh the cloud the cloud agent SDK maintains conversation
27:58 — context. Uh so turn one it’s doing research. Turn two it’s writing a report. uh and because they are both using in this line the claw SDK client uh that con the context from the first one is shared to the second one and that’s all you really need to know about what this agent is doing. I’m not going to go into the details of the boiler plate there that is just uh handling output. Uh so next you set your API keys. Like I said you did all of that stuff already.
28:29 — Uh you’ve done your register step. Uh and uh I want to emphasize again if there’s anybody in the room using Fenix AX, you are in the wrong place. You should be sorry. Using Arise AX, you should be using Arise Phoenix. But I think everyone’s got that down now. » It redirects you from one to the other. » Oh, really? » Oh, yes. Uh so we’ve showed the register function already. the register function to sets up open telemetry. Open
28:59 — telemetry also called hotel is used by all of the major observability providers. So anybody who’s doing logging, Kubernetes logs, all that kind of stuff is done via hotel. There is a layer on top of hotel uh called um uh open inference which adds a lot of extensions to hotel uh that are LLM specific. So uh stuff like prompt text, completion text, token counts, which model was called, what tools were invoked, all of that kind of stuff is handled by, uh the open inference extension to hotel, uh that everybody
29:30 — builds for. Uh and now we’re building our agent. Like I said, those those two steps. Um this part is just us setting up our uh agents. Uh we’ve create we’re using claude haiku. The reason we are using claude haiku is because it is reliably dumb. It’s not a very good agent. So it’s going to make some mistakes which will then give me something to test against. Uh and I’ve given it one allowed tool which is web search. Uh and
30:01 — I’ve also given it a permission mode of accept edits. Uh because otherwise uh it keep it kept trying to prompt me to like be able to change files which turned out to be a mistake that we’re going to cover later. Um, so it’s cheap and fast and it’s great for demos like this. Uh, and these are our two turns, research and write report. Um, and we wrap all of this in an async function called financial report. Uh,
30:31 — which is all we need to create an agent. Um, we also do this line here which is we are wrapping all of it in an open telemetry span. Uh the reason we are doing that is because by default uh the step one and the step two the research step and the write report step will show up in Phoenix as uh separate traces because they are separate agent turns. So we are here telling the agent that both of these should be considered a single agent doing a single thing. So
31:01 — we’re just sort of preformatting our data to make it easy uh to handle the agent. Uh and with that done, we can run our agent. Uh if you run your agent now, it will take a minute or two. Um you can give it your own stock ticker and your own focus area if you want. I’ve given it Tesla. Uh and this is what it wrote for me. It output some boiler plate about what it
31:32 — was doing. It was researching. It was writing. Uh and then it gave me an executive summary about Tesla. Uh so we’re asking it specifically to analy to analyze the financial per performance and growth outlook uh of Tesla. Uh and you can see that the LLM does multiple rounds of reasoning at the top. It does multiple steps of research before it gets to the writing stage. Uh what’s happening under the hood is it’s doing a series of web searches. It’s it’s doing a web search. is finding out
32:02 — whether or not uh those those uh search queries gave it useful information. It does more web searches. It goes into further depth. Uh it is doing what an agent does. It is being non-deterministically helpful. Uh and that is important because we couldn’t have written a test that said it does one web search and it gets all the answers. We couldn’t have done a test that said it did five web searches and got an answer because we simply don’t know what it’s going to do. So then when it’s done researching, we send the writing prompts and it compiles everything into a report. Uh and all of
32:34 — this stuff is non-deterministic and that’s cool. Um and that is why we need eval like I said uh because we can’t predict the output from the input alone and every single one of those decisions uh is going to be captured as a trace that we can test. Um so this is what it produced. Q425 highlights, 2026 financial outlook, key growth drivers, risk assessment, fantastic. Uh, it looks pretty legit to me. Um,
33:05 — but pretty legit is not enough to ship something to production. That is the whole point of eval is I is that the vibe is good, but we want to be doing better than vibes. Uh, so let’s uh see what Phoenix captured when I ran that agent. This is what the Fenix UI looks like. This is each one of these rows is a span. So, uh, at the top level, you’re seeing the input and output of the entire agent span. That was that thing that we did that wrapper around. You’re
33:35 — seeing that it said research Tesla and focus on financial performance and growth outlook. And you’re getting the same output that we show showed in the collab. Uh, but inside of it, you can see all of the things that went into it. So, you can see in this instance, it did four different web searches. It it queried financial performance uh it queried growth outlook 2026 future. It queried quarterly results revenue profit margins uh and it queried uh cybert truck production roadster demand 2026 which I feel is particularly ominous for Tesla in 2026. Uh
34:08 — and you can see the two steps one where it did the research and you can do the the other step where it did the output. Uh so all of the steps everything every single thing that your agent did did is captured in the Phoenix UI. Um and that is the key to observability. We can observe. We can see what the hell is going on. That is what we wanted to be able to do. Traces reveal every decision the agent made. Uh and you can click into any span and see exactly what the model received as
34:38 — output. like I said, um, and all of that stuff. Uh, let’s switch back to slides. Uh, what lives in a span? We’ve got a whole bunch of stuff. Uh, if you click into a span itself, you can get the exact details of a span. So, you can see uh, annotations, attributes. Like I said, there’s a lot of JSON under the hood. This is an actual span. This is, you can see there’s cost in there. there’s what
35:08 — model it was using. There’s the token count, how many things it went through. Uh and then the output messages is giant blocks. Uh and a whole bunch of other stuff. So, uh what, uh Phoenix is doing for you here is making this stuff a lot more readable than it would be by default. Um but just one span, just one agent run is not enough. We want to be able to have multiple spans. We want to have a whole bunch of data uh to know what’s going on. So uh in the notebook there are 12
35:39 — test queries uh that you can run. Um it just takes a whole bunch of uh tickers and a whole bunch of focus areas uh and it runs them in a loop and runs the agent 12 times. Um I’ve already run these ahead of time because it takes a long time to do that. But you should kick them off now and while I’m talking they will run in the background for you. It should take about five minutes but timing on this stuff is tricky. Uh that is why I picked haiku as well because it’s fast because it’s not thinking very hard. Uh
36:11 — and while we’re doing that we can go back to our slides and cover some theory. Um so we’ve got single ticker analyses different focus areas financial performance revenue growth competitive landscape AI strategy uh and one comparative analysis uh where I’ve asked it about two tickers at once. have asked it about Apple and Microsoft uh and told it to compare the two. Uh this gives us the variety that we need. This is important uh to test a variety of possible use cases for this agent and how it would behave in those cases. Uh
36:42 — in real data, you would not just have 12 of them. You would have several hundred of them based on uh all of the things you’re asking people to use this agent to do. Uh and it’s very important to cover the edge cases. The Apple Microsoft comparative analysis for instance uh is a lot harder than a single ticker query because it has to do two sets of research about two different companies and not get them confused uh and then write a report about both of them at the same time. Uh the Rivian query for instance asks about a company that doesn’t that isn’t public yet. So
37:12 — there’s much less obvious data sources for where it can get that kind of stuff. Uh the Coca-Cola analysis the KO uh is a very different kind of analysis from growth stock queries. So if we only tested on Apple and Nvidia, we might get uh false confidence about how good our agent is at doing financial analysis because those are huge tech companies, lots of information available online. Uh whereas people using this to do actual stock ticker analysis are going to ask about much more obscure companies and much more complicated questions.
37:43 — So as you’re running it, you will see uh your traces begin to stack up. My traces look a lot more complicated than your traces right now. Don’t worry about that. We’re going to add all of the complications as we go. Yours are going to be really short little rows right now. Uh uh you can see all 13 of them now because I ran 12 plus that original one. Um the initial Tesla query is there. Uh and before we write any evals, we have to
38:13 — look at the data. This is what I said at the beginning we were going to do. Uh a lot of people just handwave away looking at the data and like you look at your data. What does that actually mean? Uh you should when writing a real eval suite uh be actually reading traces on a regular basis. This is not something that we’re doing just in a lab. It’s not something you do just in theory. You have to read the traces when you are putting together your initial evals because that is how you know uh what the agent is actually producing and what you
38:43 — should actually be testing because it’s so non-deterministic. What was the input? What was the output? What specifically is broken? Uh, Anthropic for instance invested a whole bunch of time and money in tooling that makes it very very easy for them to read their a to read their agent eval uh that is and their agent outputs. That is what uh we are doing here for you. We are providing you with that tooling. Um so uh you need requirements first before
39:16 — you can categorize failures. uh you need to know what success looks like. Uh a way that evals tend to fail is uh asking questions that are too broad or require context that use that your agent doesn’t have. Uh so you can’t say it doesn’t work. You can’t have if you haven’t defined what works means. So for our financial analyst, what does a good report look like? You can’t just say the report should be thorough. You can’t just say the report should be in-depth. What does that mean? Uh so in
39:50 — this case we want our rep our agent to report actionable recommendations. We want it to be able to tell us whether we should buy the stock or not. That’s a very specific thing that we need to be able to do. So it should be able to distinguish between forward-looking analysis and historical summary. Uh because those are our success criteria. And like I said earlier, this isn’t rocket science. This seems like obvious stuff. Uh but you need to define that stuff upfront uh and write requirements down uh because otherwise you will have vague evals that are basically flipping
40:21 — a coin. You’re like did the did it do well or what not. The agent doesn’t know because the agent doesn’t have enough context. It doesn’t have enough rules to decide whether or not things are good or bad. Uh and this is not a purely technical exercise. This is where you should get your stakeholders involved. This is where you should get domain experts involved. This is where you should get your product managers involved. This is where you should get your actual users involved uh because they are the ones who know what good looks like. The engineer who is writing your agent is not necessarily going to be the best judge of uh what a good
40:53 — outcome looks like. They might say, “Oh, well that looks legit to me.” Whereas an actual user would go, “This is completely useless.” Uh so it’s very crossunctional. Um, and uh, if you can’t define what great means, you’re not going to be able to uh, write an eval that checks for what a great is. So, a quick note on where to get test data. Um, we have in this workshop production traces because we’ve already run the agent. Uh, what if you’re building something brand new?
41:23 — What if you are in development and you don’t have real users to generate real trace data for you? Uh, the answer is synthetic data. get uh yet another LLM involved, get it to generate a bunch of fake queries uh that are deliberately as across the spectrum of use cases as the LLM can imagine uh to generate real traces that test what your agent would do if you gave it that data. Uh the examples that I like are uh if you give this uh if you give a financial anal analysis agent to a bunch of real
41:55 — users, one of them is going to ask you know research Tesla financial performance. One is going to ask what’s going on with Tesla stock, which is the same question phrased completely differently. One is going to be like, yo, is Tesla a buy right now? Uh which is the same question uh you know phrased for Gen Z. Uh, but they’re all the same intent, but they look very different. So, the outcomes might or might not be different, and you won’t know until you run your evals. Uh, we’re also going to need to include edge cases. So, non-existent tickers,
42:25 — multiart questions, jailbreak attempts, adversarial stuff. Um, they might just be 1% of your traffic, but they are the 1% of your traffic that tends to end up on Twitter when your agent goes completely off the rails. Uh, so your test data should look like production data. Best place to get uh eval data uh test data is from production. Throw the agent into production and get some stuff. If that is too dangerous, then synthetic data is your next best bet. Now, let’s examine those traces. If you’ve been running it in the background, you should probably have
42:56 — finished generating those traces by now. Uh so, let’s see what Apple did. uh I wanted to be able to, you know, go to specific things that went wrong in specific ways, which is why all of my data is pre-generated because it’s all non-deterministic. So, I couldn’t predict that it was going to go wrong. Uh so, Apple’s one did a really funny thing, which is it did a whole bunch of research. Uh and then it tried to write the output to a file because it thought it was clawed code and it was like, “Oh, you wanted a report, so I’m going to produce a markdown file on disk for
43:26 — you.” uh and then it uh so it completely it tried to call the write tool and then completely failed because it was running in a a notebook which doesn’t have write permissions. Uh and this was a real failure of the agent that I didn’t predict when I was putting together this demo. I was not expecting uh that it would try and write to disk. I didn’t give didn’t ask it to do that. Uh I was just too vague about my requirements of write a report. Write a report to the output not to the disk. Uh
43:57 — so what actually happened without with Apple? What did it say? Uh it presented a concise financial report, but it didn’t say whether I should buy it or not. It just said dominates in near-term profitability and market cap, which is nice, but not what I wanted, right? I wanted something to tell me whether or not I should buy this stock. Uh the Nvidia report uh was likewise uh very thorough, but not actionable. It didn’t tell me whether or not it should buy it. Um, we show other ones were interesting. Uh,
44:28 — let’s look at the Nvidia one because it is cool. Where did I put Nvidia? There it is. Uh, so it did four web searches, uh, including the competitive landscapes, specific competitors, stuff like that. Um, uh, and it actually did the output correctly. It wrote the output to, uh, the the agent itself. Um, but this is why you have to read your traces because if I’d read my traces, uh, if I hadn’t read my traces and I
44:59 — just judged it on the output, the output of the Apple one where it tried to write to disk is complete garbage. But it’s not because it didn’t write the report correctly. It had all of the data that it needed. It did all the web searches correctly. It wrote a good report. It just wrote it to disk and then failed. Uh so you need something that can you need to be reading your traces because you need to know these unexpected outcomes are happening. Uh and if you click through all of these traces uh you will see that how quickly can we yeah you can see that the right
45:29 — failure happens multiple times. So there is uh a pattern there just by eyeballing it. We can see that uh it tries to write to disk multiple times. This is a stem a systemic failure in our agent that we’re going to have to address. So, uh, the last one is an example of, uh, uh, confidently wrong. This is the last one I’m going to click into. Uh, the Rivian report. Uh, it includes all sorts of
46:00 — stuff about Rivian. Uh, like it, you know, shipped 62,000 to 67,000 vehicle deliveries in 2026. Uh, did it make that up or not? I have no way of knowing because I don’t have the information about Rivian and Rivian’s information isn’t particularly public. So that is a suspiciously like accurate number for a company that doesn’t share its financial results. So again, reading the traces, I’m going to look at that and go right that that’s a situation where I should be checking carefully whether or not uh it is hallucinating that information,
46:31 — whether or not it really got that research data from somewhere that I trust. Uh and when you find a failure, you have to ask why it failed. Uh so the response was wrong as a symptom. Did it get bad research results? Did it try to write to disk when it shouldn’t have? Did it have the right data but produce the wrong conclusion? Did it make up a stock price? Uh each wrote cause points to a different fix. Uh so the thing that you have to do with this data is categorize it. Looking at individual traces is is very very important. Uh and then you produce on
47:02 — mass uh a pattern of where things are going wrong. not just going wrong once in a while, but going wrong systematically. Uh, and like I said, you can use an LLM to do that. Uh, you can use an LLM to run through all of your traces. There are APIs to do that in Phoenix. Uh, examine all of your output, annotate your traces, and say, “This was bad for this reason. This was bad for that reason.” Uh, and in a real data set, you’re going to soon find patterns emerging. Uh in the notebook I have printed a rough and
47:33 — ready graph uh of things that went wrong. I went through my trace categories. I said you know things that were uh good and bad. Uh I printed them out uh and then I got it to produce a table. So root cause frequency mostly looks good. Possible hallucination reasoning gaps unverifiable data uh missing recommendations. Um the uh it’s easy to say to look at this graph and say okay so possible hallucination is the most is the most important thing
48:03 — because it’s the one that appears most often but in reality you need to be more subtle than that. uh if it went completely off the rails and started you know uh spewing the text of Moby Deca to the user that is a complete failure a very severe failure uh and that uh is more important than a possible hallucination. So you need to sort of multiply your severity times your frequency to figure out your priority when you’re deciding which system to look at most commonly uh
48:35 — and fix the expensive frequent failures first. So the expensive stuff is where it’s gone completely off the rails and like minor hallucinations you can probably get away with. Uh I wanted to introduce the Swiss cheese model. Uh this diagram blatantly stolen from anthropics blog post. Uh it’s a concept borrowed from safety engineering. Uh you have to imagine each layer of defense as a slice of switch Swiss cheese. There is no set of evals that are going to be perfect. Uh they’re all going to have various flaws and
49:05 — various holds. But if you layer them, uh the holes aren’t going to line up. Uh so eventually you will stop all of the possible ways that your agent can fail. Um stacking your eval layers works like that as well. Uh so your code eval catches a bunch of really basic stuff first. Your LLM as a judge catches reasoning gaps but misses subtle hallucinations. Your human review uh captures captures things that got through the first two layers. Uh and but it can’t scale to every trace. So no single eval method is going to capture
49:36 — all of these ways that it can fail. But if you use them all at the same time, uh you’re going to do a pretty good job of evaluating your agent. So let’s talk about the actual evals. Let’s write some real eval. Uh and starting with the simplest and uh simplest type of eval, which is a code eval. Uh our agent is supposed to analyze stock tickers. So the most obvious thing to do is to test whether it mentioned the stock ticker in the output at all. Was it talking about the company that I
50:06 — was expecting it to talk about? Uh that is a completely deterministic test. I don’t need an LLM to do it. I can just search for that string. Uh so that is what I have done in the notebook. Uh and I’m just going to walk very quickly how through that how that works. So first uh we have got our spans from the AIE cloud financial agent. Uh this chunk here is us getting the parent spans. Uh every single log line is a span. Uh the
50:38 — ones that are at the root uh in that graph that I showed you uh they are the ones that have no parent. So that is how you detect your root spans. The the important ones are the ones that have no parent. So that is what it’s checking for uh there ones that have no no is parent set. Uh and we found 13 top level spans which is correct. Uh this is what a code eval looks like. Uh we use the create evaluator decorator. Uh we give it a name. This is what it’s going to be sent to Phoenix as. This is the name of the eval. Uh and we give it a kind which
51:10 — is code which is to uh tell Fenix that this is a simple deterministic eval. It doesn’t need to uh run it as an LLM. Uh and then it’s just a regular Python function. You can also write them in Typescript if you prefer. Uh this is a very basic regax. I went looking for uh phrases in capital letters uh and then I excluded excluded some phrases that are not uh stock tickers. Uh and then just went basically looking for uh the uh stock ticker in
51:40 — the text of the output. Uh and that is a very simple and very easy and extremely effective uh first line of defense. Did it write anything at all that mentioned the company that I was asking about? So, uh, we can get the results here. Uh, we can run the tra. This is how you actually run that eval. You evaluate data. You run evaluate data frame. Uh, you give it the evaluator which is the mention sticker function that we just passed. Uh, and you give it the dataf frame apparent spans which are the ones
52:10 — that we just did. Uh, one thing you should notice here is that we are doing with suppressed tracing. uh we are using anthropic to do the eval um so uh by default Phoenix is going to pick up that anthropic is being run uh and so it would if you don’t suppress tracing there you will get the traces from anthropic running the eval itself which is very confusing so you just tell it when you’re actually running the eval don’t capture the traces that’s why that’s there uh
52:40 — we can see that it passed 11 times out of 13 so uh the Tesla report and the Amazon report, neither of them actually mentioned uh the stock ticker of the company that we were expecting. Uh so even this very very basic uh deterministic uh Kodval has found a problem with my agent. Um uh so why did it do that? Uh we have to
53:12 — look at those uh we have to look at the spans to find out. If you look at the spans to find out, you’ll find that Tesla is the one is one of the ones where it failed because it was writing stuff to disk. Uh Amazon was more interesting. Uh it wrote the entire report about AWS and didn’t mention Amazon at all. Uh it just assumed that I meant AWS, the one part of Amazon and not all of Amazon. Uh so uh that is a a learning that we wouldn’t have got if we were just uh testing
53:42 — input and output right that’s what the explanations are for. Um but what this proves is that uh code vals aren’t just toy examples. So JSON parsing length limits forbidden phrases like as an AI language model I cannot uh you can test for those deterministically. Uh and a code eval doesn’t have to be a single a simple string operation. Um it can query a database to verify that your product pricing was correct. Uh it can call an API to check a stock price. Anything with a grading always gives the
54:12 — same answer. You can use a deterministic code-based output uh code-based eval uh to get the answer and test things. So, one of the good thing, one of the things that you should be careful of when you’re writing a code eval is that you should be testing what the agent produced and not the path it took. I mentioned this earlier. Uh, so don’t get it to look for all of the steps that you think that it should have taken to be able to get to that answer. Just check whether it got the answer you were expecting. Uh, you can also be uh flexible in your parsing of strings. So
54:44 — if one of the things that you were asking for was like a time estimate uh you the agent might say 2 hours or it might say 120 minutes or it might say a very large number of seconds all of which an answer to two hours. You can put your code eval to check for all three of those things is a valid output. Uh and the path the agent took matters less than where it ended up. Uh you want the answer was right. Uh but you got the wrong you got there the wrong way. not to be something that your eval is
55:15 — looking for. So let’s move on to step five, which is the built-in eval. Uh in this case, we are going to try using one of Fenix’s built-in evals, which is the correctness evaluator. This is using uh LLM as an eval uh to do things that no deterministic check could do. Uh in this case we’re asked we wanted to find tell us whether or not the output of the of the agent was correct. Was it
55:45 — factual? Uh so you know the correctness uh evaluator seems like the one that we should go to. Um every LLM is a judge eval has three parts. It has a judge model which is the LLM model that does the grading. It has a prompt template uh also called a rubric. uh and it has the criteria which is the criteria the judge applies and it has the data which is the examples being evaluated. Uh Phoenix keeps those three things separate so you can mix and match them which is very useful. You can try uh the same evals with different models. You can try the
56:16 — same model with different evals uh and compare how your evals are working. Uh so like I said correctness checks whether a response is factually accurate, complete and logically consistent. It is an eval that we wrote that a prompt that we wrote that is built in. You can inspect what it the prompt that we’ve written so that you know that it’s not just a black box. Uh and we also have things like eval tool selection. So did the agent pick the right tool? Tool invocation. Did it pass
56:46 — the right uh arguments? Uh we have uh built-in evals for document relevance refusal detection. uh lots of things that you would want to check in a real eval uh that you don’t have to write yourself because every evaluite needs to run these tests. So we built them in. Um what we want to know is uh was it correct? So uh let’s set up the judge. We’re going to like I said we’re going to use the built-in correctness evaluator. First we need an LLM to give
57:17 — give to it. We are going to pass that LLM to the correctness evaluator which we’ve just pulled in. If you wanted to, you could uh print out the full prompt here. Uh we’re going to suppress tracing again and then we’re going to say evaluate data frame just like we did before. We’re going to give it the same set of parent spans and we’re going to give it a different evaluator in this case the correctness eval. Uh and we’re going to display the first five results from that. Uh this is a blizzard of information. Um once you’ve got the information you have
57:47 — to send the results to Phoenix and that is what is happening here. So you turn uh the results of the uh eval into a data frame and then you pass that data data frame back to Phoenix using the log span annotations data frame. Um the judge that we’re using to do this is Sonnet. Like I said earlier, you want to use a more capable LLM uh to do the judging than you used to do the actual uh the actual agent in the first place because that agent is going to that LLM
58:18 — is going to be smarter and it’s going to catch things that the original LLM did not. Uh if you are already running an agent in production with Opus, there is nothing better than Opus that you can use to be your your evaluator. Uh but if you are running an agent in production with Opus, you probably don’t care. Um the other thing that we have to do is we uh rename our input and output uh so that uh the um agent understands. I’ll show you that in a second. Um
58:52 — so yes, my speaker notes and what I’m actually doing on the thing. I’m I’m going so much faster in the notebook than I am on my speaker notes. So I apologize. Um, so let’s actually run that result uh and look at it in Phoenix.
59:24 — Uh you can see that I have a correctness annotation next to every single eval now. Uh and there’s a problem which is if I look at the correctness uh overall I’ll see that every single one of them was zero. Uh all of your evals are incorrect. Why are your evals incorrect? We have to figure out why that is the case. We need to click in and we need to look at the explanations. So let’s look at our annotations. Let’s look for correctness and let’s actually just read the annotation. Uh which is that uh the
59:57 — output presents highly specific financial figures for Q1 financial year 2026. Uh but the problem that it keeps having is that it thinks it’s 2025 because it is a model trained in 2025 and it doesn’t know what date it is and it doesn’t have up-to-date information. Our correctness eval is in this case complete garbage because it is trying to it is trying to base it on its knowledge of Q of Q3 2025 or whenever it was at anthropic train sonnet uh and it doesn’t know anything about the 2026 look ahead stuff. So our correctness eval in this [1:00:27] case is complete use completely useless. Uh that is itself a learning right? If you were asking it more general knowledge questions that could be judged from uh an entire troll of the internet uh it would have done much better but in this case we’re asking it for very very up-to-date very future-looking stuff. Uh and our built-in model uh our built-in eval can’t do it. Uh so it tells us that we need a different eval. What we need is a faithfulness eval. Uh the faithfulness eval is uh [1:01:00] basically if you’re familiar with rag applications uh it’s going to check whether or not the output of the agent was based on the information that we gave it. So conveniently uh suspiciously conveniently I split up our agent earlier uh into two steps. One of which does the research and one of which does the output. That means we can take the step where it did the research take that output and give it to our uh faithfulness evaluator and say based on this research did it correctly uh judge [1:01:30] did it correctly write a report? Did it write a report that is based on this research and only this research? Did it stick to the source material? Uh so let’s see what that looks like. Uh we get the same set of child spans exactly the same set. uh and we uh have to do the input massaging that I mentioned. Uh we have to take the inputs and turn them into the inputs that the faithfulness evaluator is expecting. Uh and we have to produce a [1:02:01] context column. Uh which is the output of the uh first turn of the agent and only the first turn of the agent. So that’s what that code there is doing. Uh so that gives us input, output and context which are the three columns that our faithfulness eval is expecting to run on. Uh so let’s actually run it and see what happens. We added the context. We can run the context. Uh we suppress tracing again. [1:02:34] We get take our parent spans. We run them through the faithfulness eval. And we’ve we give it a data frame this time which is the spans with that context column added. Uh and we get our evaluation of 13 out of 13. Uh and we get that 13 of the all 13 of the 13 were faithful. Uh so correctus gave us zero out of 13 passes but faithfulness gave us uh uh 100%. Uh so we’ve managed to you know oneshot our uh faithfulness. Um [1:03:04] two built-in eval signals. That’s a really important lesson. Um, choosing the right eval me can matter more uh than tuning your eval. One of some of your evals are going to be completely useless. Some of the built-in eval are going to be exactly what you need. Uh, and that is the lesson there. Um, here’s what you see in Phoenix for that stuff. Uh, you can see our uh, faithfulness eval got 100% correct. Um, you can sort by score if you want to to [1:03:34] find your best performing or your worst performing. Uh, in this case, that’s only going to that’s not going to do anything because we just have a one versus zero score. Um, but you can filter. So, you can uh filter to show only failures. So, in this case, I’m going to uh look for my uh actionability ones. Uh, no, actually, I’m going to use my correctness ones. correctness [1:04:07] incorrect. I’m sure you’re all being delighted by me watching it. So, this allows us to filter down uh to only the ones where we uh got a correctness of zero. Um and you can click into any of those failing traces to see the full execution exactly what went wrong. So, uh when there’s only 13 of them, that’s not super useful. If you had a thousand of them, that would be really useful because you’d be able to find only the ones that are failing and focus on them. Uh, but built-in evals are only your [1:04:38] starting point. They give you an immediate sample uh an immediate signal without any prompt engineering. Uh, when you get uh into the real meat of evaluation is when you do custom code evals, sorry, custom LLM as a judge evals. Um, I mentioned earlier that one of the things that we want our agent to do is to provide actionable results. We want it to tell us whether or not we should buy this thing. There’s no built-in eval in Phoenix, which is is this an actionable financial report. We are going to have to write that ourselves. Which brings us to step six, which is writing a custom eval rubric. [1:05:09] Uh it is worth talking about how to put together a good eval rubric before we actually do it. Uh every good eval prompt I’ve seen has uh five five different parts, five important parts. Uh and I’m going to go through them, go through each of them. First one is defining the judge’s role. Uh you have to give the judge domain context. Tell it what kind of output it’s it’s evaluating and what that output is supposed to accomplish. Uh this is an example of uh doing that. The you are an [1:05:40] expert financial analyst evaluator is something that everybody does. Uh tests show it doesn’t make that much difference, but it does make some difference. So you may as well throw it in there. Um, what really makes the difference is the instructions and the uh rest of uh of the uh prompt. So part two is your criteria and this is where you should be as explicit explicit as you can possibly be. Uh this is where a lot of people underinvest. So like I said, don’t say a good response because that is an aspiration. A good response [1:06:11] doesn’t have any context about whether it what defines good and what defines bad. A good response is helpful and accurate. sounds like you’re doing better there, but helpful is also completely vague and so is accurate. Nobody knows what that means. Uh so instead, list exactly what makes a report actionable. Uh and list exactly what makes it not actionable. Uh what makes these criteria specific enough to actually work? Uh they are specific and observable. Contains specific recommendations. That is something that the judge can check for. Includes [1:06:42] forward-looking analysis, not just historical data. That is a clear distinction. uh and on the not actionable side only summarizes publicly available data without interpretation that describes a particular failure mode that we’ve already seen when we were looking at our traces. We saw it you know just giving a general apple is good kind of I guess uh response which is not what we wanted. Uh and each criterion maps to something we actually observed in our traces earlier. That’s what I’m trying to get to here. You shouldn’t just be writing rules uh on the basis of [1:07:12] what you think would be a good idea. You should be writing rules based on the actual traces you’ve seen and the actual failures that you’ve observed. Part three of a good rubric is you should present the data clearly. Um we use begin data and end data in our Phoenix built-in evals. Uh if you’re using cloud, claude really loves XML. So you can use XML tags to begin and end and it’s very good at uh judging the start and end of things. uh but what you’re doing with them is you are labeling each data piece of data clearly so that it can tell the difference [1:07:42] between what these pieces of data are. So uh user query financial report clear boundaries uh reduce the chance that the judge is going to confuse uh the query with the report or vice versa. Part four is adding labeled examples and this is the part that most people skip and it is by far the most useful thing that you can add. uh uh if you take one thing away from this whole section on custom rubrics, it is that you should be adding examples because LLMs are really good at looking at an example, figuring out what the [1:08:12] pattern is from that example and then doing and following that example and they are much better at that than they are at getting a list of instructions that just say this is what you should do. They are really really good at looking at the example and following it. Uh and this is what an actionable example looks like. uh it has everything. So it has uh specific data. It has uh identifying a concrete risk with a number to back it up. It has specific recommendations. Accumulate below a certain price. That is what [1:08:42] actionable looks like. This is what not actionable looks like. Uh it’s not wrong. Nvidia is indeed a major player in the semiconductor industry. Uh it’s just not telling us whether or not we should buy the stock. Uh so that is just a description. It’s not a demonstration. Uh and the fifth thing that you should do is you should constrain the output. Uh we want to say is this financial report actionable or not actionable? Output one word is a thing that your rubric should nearly always have. Just tell me whether it’s actionable or not [1:09:12] actionable. Do not give me a long explanation. Do not give me a JSON block. Do not give me, you know, a markdown diagram. Uh binary is really clear and if you genuinely need more more nuance, uh you can give it three categories. You can say like it was incorrect, it was partially correct, or it was completely correct. Uh, a thing that a lot of people want to do because it seems like a good idea is a rating. They’re like, give me a percentage or give me a score from 1 to 10. Uh, these do not work very well. What is the [1:09:42] difference between an a six and a seven? Can you define exactly when you are judging something from 1 to 10? What is the difference between a six and a seven? Unless you put that into the rubric, uh, the agent can’t do that either. So it’s just going to introduce noise into your ratings. So yes or no and if you really need to a maybe uh but no more detailed than that. Um another practical tip is when you are writing an email is to uh get it to uh think out loud about what it is doing. [1:10:12] Chain of thought for judges uh demonstrabably improves how the judges work. Uh so you tell it to explain its thinking first before it outputs that label. uh and that is going to that is you know measurably demonstrabably going to improve the quality of the output that you get uh because it’s going to do a bunch of uh token generation before it decides whether or not the thing is good. Uh so if you are coding along now is the time for you to see if you can write a better eval than I can. Uh [1:10:43] this is what a custom LLM as a judge looks like. It is just a really big prompt. This has all of the things that I just mentioned. So, uh you are an expert financial analyst. Uh list of things that make it actionable, things that make it not actionable, an example, uh a begin data and an end data block where I put in the input and the output. Uh and based on the criteria above, is this financial report actionable or not actionable? Uh you are probably not going to be able to [1:11:14] write a better like off the top of your head going to be able to write a better prompt than this because I had a lot of goes at it. uh but it’s fun to try. So feel free to uh plug in your your examples of what uh an actionable report should look like. Now uh and then we’re going to write it down. Uh to actually use this once we’ve written our prompt uh we are going to use a helper called the classification evaluator. This uh creates an LLM as a [1:11:45] judge for Phoenix. Uh, and we only have to give it four things. We have to give it a name, actionability, which is the label it’s going to show up with in Phoenix. Uh, we have to give it that prompt template, which is the thing that I just showed you above. We give it an LLM because it has to do the judging somehow. And then we give it choices, uh, which are labeled with scores. So, it’s either actionable, which we’re saying is a 1.0, or it’s not actionable, which we’re saying is zero. Once again, we suppress tracing. Yeah. » Sorry. So, what about the channels? I think we’re not saying yeah first write [1:12:15] your reasoning and then and then just » That’s true. I left that out. I told you you should and then I didn’t. Well spotted. Somebody’s actually paying attention. This is amazing. Uh uh because it’s really warm in here and it’s like 4:30. Uh so let’s look at the scores and see uh what stuff came in as an actionable and what did not. Uh this is a capability eval. I was talking about regression eval versus capability eval. This is a perfect capability eval because it’s not doing very well. Uh in uh six I think of [1:12:49] the cases uh it came up with actionable stuff and in the rest of the cases it came up with not actionable stuff. Uh that means it has a hill to climb. That means that we can uh tell the agent to get better. Um uh and it’s actually going to have some headroom to get better. Um, so we log the annotations back to Phoenix. Uh, and now we can look at Phoenix and you can see the actionability scores inside of Phoenix. [1:13:20] We can do the same filtering that we did before. Uh, actionability. I did not think about when I was doing this whether or not I was going to be typing with one hand. I would have chosen shorter labels. Right? So that’s actionable or we can say not actionable and we can get the ones where it failed. Uh and we can like we did with the other ones we can click through uh to our annotations look for actionability and get an explanation. So we can say the [1:13:51] financial report is described lacks a concrete buy, sell hold recommendation. Uh haiku is messing up here. This is what this is why we picked haiku because it was going to mess up. Uh and uh you can filter and sort by anything. So you can filter by latency, you can filter by number of tokens, you can filter by cost. You don’t just have to you don’t have to filter by just your labels. So uh one of the ways that you can use that is for instance if you have an agent uh that is getting the right answer but very expensively, right? It’s doing a hundred web searches and [1:14:21] eventually getting your answer. Uh that is not good in production and you can go and look for those uh expensive calls, those expensive operations and tweak your prompt such that it does things in a cheaper way. So you don’t have to just uh Phoenix is giving you a bunch of things uh to search for that are not just uh the stuff that you put in. It’s giving you a bunch of extra information. Uh like I said uh this one uh it said that it was not it lacked a concrete bell [1:14:52] buyell hold. Uh if we filter through to other ones we can get other actionability explanations. Uh the report presents strong financial data and forward-looking analysis uh and provides context uh but the the absence of an explicit recommendation or actionable investment directive places it in the act in the not actionable category. Great. That is what we wanted this thing to do. So uh you should be treating your evals like code. uh your evals, the wording of [1:15:24] that prompt that we just put together of how exactly to measure whether or not uh your agent was doing a good job uh is going to drastically change word by word inside of that prompt because LLMs are so non-deterministic. So you should be versioning your prompts. You should be storing them. You should make sure that you know like five versions ago, what did this prompt do if things go radically off the rails? uh and you should test them on examples where you know the right answer. So if the judge disagrees with your human [1:15:54] labels on 40% of examples uh that means your the prompt you’ve written is not very good. Uh you can iterate on rubrics uh without touching code. Uh and valid that you haven’t validated is just a fancy way of being wrong at scale. Um, one other thing that I want to flag uh in writing custom eval is in custom LLM as a judge is the god evaluator. It is very uh is very tempting to write a single LLM evaluation that tries to test [1:16:24] for everything. So it’s going to test for accuracy, tone, completeness, policy, compliance, formatting. Uh, and don’t do that because it is a nightmare to calibrate. Uh, if it fails, you don’t know why it has failed. you’d have to get it to output an extra word saying this is the one of the five things I was testing. Uh instead uh split your evaluator into one thing per dimension. Uh so test for accuracy, test for completeness, test for tone. Uh but test for all of those with a separate LLM eval. Um [1:16:57] and uh I want to talk about guardrails versus northstar metrics. Uh some evals are guardrails like they’re ship blockers like uh if uh the agent hallucinates a stock price that’s probably a hard fail for this one, right? We don’t want it to say buy when it should have said sell. Uh but if we say that you should always recommend complimentary investments, that is a nice to have, not a deal breaker. Uh so you need to know which of your eval is which, which ones should be ship lockers, and which ones are just informative. [1:17:27] Uh meta evaluation is the thing that we’ve been dancing around so far. Uh how do you know that this custom rubric that you’ve written is actually working? How do you know that uh these this thing where this this code that you wrote that is now saying whether stuff is actionable or not actionable is trustworthy? Uh the way to think about it is uh your judge is a classifier. it is an an ML classifier that is the mental model. Uh [1:17:57] so it takes an input and it makes a prediction. Um and just like any classifier you can measure its performance by uh comparing its predictions against ground truth. Uh so in this case your own human judgment you can use to uh check the LLM’s work. Uh the problem with human judgment is that it involves a lot of human effort. uh you have to actually look at the results of your evals and compare them manually saying do I agree with it in this particular instance uh if I go into a [1:18:29] span in Phoenix uh I can add an annotation uh so in here I could create a new annotation which is like does the human trust this and say yes or no uh and I can do this programmatically or I do this in the UI. Uh, and I can put in a whole list of human evaluations of my evaluations, whether or not I think these things are actionable. And then I can do a comparison of the LLM as a [1:18:59] judge to my human annotations. Uh, I’m not going to go into the detail of how that is done because it is boring. But um, what we’re doing here is we are building a golden data set. Uh, golden data sets are incredibly helpful. I mentioned them right at the beginning. Uh they are a way of measuring whether or not your evaluator is doing a good job. Uh and the way that you do this is the same thing that the LM LLM did. You should give yourself real concrete criteria for deciding whether or not uh [1:19:31] this thing is a success or not a success. So don’t just go in as a human and be yes, it worked or no, it didn’t work. Uh because that is you’re going to have the same problem the LLM is having. You’re going to be arbitrary. uh you should be essentially reading the same prompt that you’ve given the LLM and deciding whether or not you believe uh according to these rules that you’ve given yourself whether or not this thing is or is not actionable. So give yourself examples, eliminate ambiguity uh and uh eliminate the chance to get [1:20:02] lazy. Um as you do this you should build you should uh keep your tasks unambiguous. Um, if your agent scores 0% consistently, that’s almost always a broken task. That always means that, you know, your agent is failing to do anything. Uh, so it’s not a good eval. Uh, if you, this can happen if you’ve made your eval task something only a human could do and your LL you’ve made it something that your agent couldn’t possibly manage. Uh, so for each, the [1:20:33] way to avoid that is for each task, create a reference solution. decide in advance what it was that you wanted an LLM to have output in this situation and then you’ll be able to judge whether or not the LLM is getting anywhere close to that. Uh you can also test in both directions. You can test uh cases where the behavior should occur and cases where it shouldn’t. Uh and you should only test uh for instance if you if you had a test that was does it search the web, that is absolutely a thing that you’d want a financial agent to do. uh [1:21:03] you could accidentally create an agent that always that cheats the eval. You could create an agent that always searches the web whether or not it needs to. So you need to make sure that you have test cases in your eval set that are this is a case where it doesn’t need to search the web uh and did it not search the web when that happened. And your golden set is not just test data. It is the encoded judgment of the people who know your domain the best. Uh so it’s going to grow as you find more test cases. uh and today’s production failure are going to become tomorrow’s [1:21:33] test case. Uh if you’re doing this for real, the other thing you should do is you should split your golden data set. Uh it is possible to create an eval that is overfit to your golden data set. It passes your golden data set, but it hasn’t properly generalized. It’s just accumulated a bunch of examples of exactly your golden data set and so it passes them. So you should split your golden data set uh you know 7525 into the ones that you are training it on and the ones that you test it against. So every time you make a change [1:22:04] to your prompt uh you can then run against the 25 that it’s never seen before uh and you can see whether or not it actually does uh a good job. So uh let’s run our actionability judge. um uh on the exact same examples that we did before. Um I’ve used a span query to do the filtering just like we can do in [1:22:35] the UI. Um and I’ve got uh only the failing ones. Uh sorry, no, I’ve got the human actionable ones. Uh so and I’ve done this line here where I’ve t taken the complicated attributes.inpinput.v value and turned it into input and attributes.output value and turned it into output because that is what our eval is expecting. Um so now let’s see whether our agents disagree or agree uh with the [1:23:06] annotations that I put in uh about human actionable stuff. Uh what we get here is uh two out of six times uh my human actionable label and my actionable label uh have disagreed. Uh I will confess that what I didn’t do is actually come up with uh human actionable labels. I just assigned human actionable versus [1:23:36] not actionable kind of at random so that I would have some real data to point out because I only have six six answers here. Uh six places where it failed. Uh and so uh it’s really not enough to do a real set. A real set would be 20 50 100 200. Uh so to have actual data to look at uh I just uh put stuff in at random. Um uh but what this gives us is you know if I had done it for real what this would [1:24:06] give me is uh a sense of whether or not uh my judge my human judgment is matching up to the judgment of the LLM. This is uh the this is my golden data set testing against uh the LLM as a judge and figuring out where they disagree. Um which brings us to uh rubric iteration. Um in the same way that we can take our [1:24:36] agent and we can improve our agent by improving the prompt uh we can take our LM as a judge and we can uh improve the LM as a judge through iteration. Uh to do that we need to think about uh uh precision and recall which are uh more of those opaque ML terms that the researchers snuck into our AI engineering uh lives. Um they’re not that complicated. Imagine a spam predictor. Uh a spam predictor is going to say whether stuff uh is spam or [1:25:08] isn’t. Uh and you can compare your spam predictor against whether or not things are actually spam. So there’s four possible outcomes. It can say that it’s spam uh and it is spam, which is a true positive. You can say that it’s spam when it’s not spa not spam, which is a false positive. Uh you can say it’s not spam and it’s not spam, a true negative. Or you can say it’s not spam and it is spam. Uh which means that you missed. Um machine learning engineers have two ways of measuring this data. Uh and they conflict with each other. So you have to [1:25:39] pick uh you have to decide for your use case which of these things you want to optimize for. So the first way is precision. Precision is out of the number of true positives out of the total number of positives. If high precision if you’ve got high precision that means you’ve made the false pro false false positives number small. Uh that means you’re minimizing false positives. If you’re in a use case where uh false positives are really dangerous, uh that is what you want to do. That’s [1:26:09] great for spam for instance, uh because you don’t want to send a real email to spam and you were okay with getting a certain amount of actual spam uh in exchange for not doing that. Uh but recall is the opposite. It is out of the real positives and the misses, what percentage uh were really positive. Uh this is uh for this one to make it go up you want to minimize the number of misses. You want to minimize the number of false negatives. Uh a good example of when you do this is if you were doing like uh health stuff. If you were doing [1:26:40] cancer screening uh you would absolutely want as few misses as possible. You were okay with a ton of false positives uh when you’re doing cancer screening as long as you don’t miss somebody who actually has cancer. So uh those these two uh measures are going to go in opposite directions if you optimize. So you have to pick one for your use case and decide how to optimize it. Uh so let’s look about look at what that looks like uh in practice. Uh I wrote a bunch of code here that calculates all of these things. Uh and it came up with uh [1:27:11] precision and recall uh for my judge. So uh the precision of this one uh is really good. When the judge says fail, is it right? Uh 100% of the time it is correct. Uh and its recall is really bad. Uh which is of all the real fa fails, how many does it catch? Uh this is what you would expect. You would not expect to get 100% on one of them. So I have made something uh that is really good at precision. It’s really good at avoiding false positives. Um [1:27:41] uh with such a small sample, these numbers are basically useless, right? you want, you know,if you want a golden data set of 50, 100, 200 things. Uh, and then you’re going to get real numbers for precision and recall. This is very much a toy example. Um, and in most eval scenarios, you’re probably going to want to prioritize recall because it is better to flag a few false positives than to miss real failures. A false positive just means you have to review something as a human. [1:28:11] Uh, that’s actually fine. Whereas a missed failure uh means that uh bad output reaches your users. But like I said, there are some use cases uh for instance, medical use cases where you’d want to do the opposite. Um a few known pitfalls uh with using LLM judges are worth keeping in mind. One is position bias. Uh if you present two options, the judge tends to favor, depending which model it is, either always the first one or always the last one. Uh there’s length [1:28:41] bias. LM prefer longer responses over shorter responses just in general. uh and will tend to prefer them. Uh there’s confidence bias. Your judge can get fooled by a response that sounds confident uh just like humans can. Uh and there’s self-preference bias. If you’re using the same model to judge as you are to generate the output in the first place, uh they tend to like their own output. Uh which is another one of the reasons that we use uh a different model as a judge uh than we do as the [1:29:11] one that is running the agent itself. Um, you can also consider using a completely different provider. So you can use cloud to do your agent. You can use open AI uh to do your evals. Uh, and you’re going to get more reliable evals than you would if you use claude for everything. Um, how do you know if these biases are affecting your results? Uh, you have to track judge accuracy across different categories of inputs. So if the judge always passes long responses and always [1:29:41] fails short ones, then you know that you’ve got a long bias problem. Uh if it passes everything from one category of query uh and fails everything from another, you have to dig into why. Um and the your benchmark here should be human performance, not perfection. Uh which is the last thing I want to say about meta evaluation. If you give two humans the task of producing your golden data set and say tell me whether or not, for instance, this report is actionable or not actionable, they’re not going to agree all the time. In fact, they’re going to disagree a surprising amount of the time. Uh, interrator reliability is [1:30:13] often as low as 02 or.3 of the time. Uh, so two experts the same output, the same rubric and they will disagree. Uh, so if your judge LLM judge achieves higher consistency with you than that, if it achieves point4, it’s doing really really well. Uh, so the judge disagreeing with you is not necessarily a reason to throw out your eval. It’s if the judge disagrees with you more often than a human would disagree with you. The other thing you should do is that your failures should seem fair. This is [1:30:43] something that Anthropic brought up when they talked about meta evaluations. Uh when a task fails, it should be clear what the agent got wrong and why. Uh so if you look at a failing trace and think that answer looks fine to me, uh the problem is probably the eval uh not the agent. Um this actually happened at Enthropic. Claude Opus uh initially scored 42% on a benchmark called Corebench. Uh and they that seemed low and they went and looked into what it is that corebench is actually doing and they found multiple problems not with the model but with the eval itself. Uh [1:31:15] so uh for instance the eval was checking for uh an answer of 96.12 and Claude was giving it the answer of 96.124991 and it was saying no that’s not right because that’s not what I was expecting. uh after fixing the evalus’ score jumped to 95%. So uh your evals can be uh completely uh can be judging things as wrong when they are just being too strict or they are being they are judging something that is not what you were trying to judge. And the lesson here is that you should [1:31:46] not take evals at face value. You should always be looking into the explanations of your evals. You should be looking into the output of your evals. You should be checking it against the data golden data set. uh to uh make sure that you’re actually improving the agent. So the uh step seven and the last thing I’m sure you’ll all be glad to know uh is uh uh data sets and experiments. This is how you go from just measuring whether things are wrong to actually improving [1:32:16] your agent. Uh so you’ve found some failures, you’ve read the explanations, you know what to improve. Um, so you change the prompt and then what? How do you know that your fix actually worked? How do you know that you’ve improved your agent? Uh, how do you know that you didn’t break something that was working before? If you just run the agent again on a couple of examples and eyeball it, that’s just going back to vibes. You need a systematic way of testing whether or not your changes to your evals uh to your agent have actually improved your evals in a systematic way. And that is [1:32:47] what uh experiments are for. So for this we go to a completely different part of the Phoenix UI. We go to the Whoops. There we go. You didn’t see that. Uh we go to our experiments uh evaluation. To do that uh I’m going to you can go to to produce your data set. uh you go to your uh uh to your traces and you take for instance a bunch of failing traces uh [1:33:19] and you click add to data set. Uh you can create a new data set using this little plus here. You can or you can add to an existing data set and that gives you in this case uh AI agent financial failures. So you can click through to examples and you can see I’ve taken the six times when our actionability trace failed uh sorry when our actionability eval failed uh and I’ve put them into a data set because this is what I want to do. I don’t want to run all 13 every time or you know in production all 1,000 every time. I want to run my ex my new [1:33:50] prompt against only the times uh that it failed and I want to see if it’s getting any better. Um so now we improve the agent. uh we can look at what the evals told us uh the actionability eval said that some reports were not actionable because they summarized data they didn’t give explicit recommendations uh so we can update both prompts the research prompt like I said now explicitly requires uh specific financial ratios uh it requires [1:34:20] recent news current price data uh and the writing prompt now explicitly demands a buy sell hold recommendation uh that’s happening here If you are still coding along, well, all the power to you. Uh, and this is where you can try and do a better job than I did of improving the agent. Uh, you can give it a better research prompt. You can get a give it a better write prompt. Um, notice how every change that I’ve made to the prompt here maps to a specific [1:34:51] thing that we found wrong in the evals. I’m not just randomly changing my e my agent prompt. I am changing it in response to specific things that we noted in the eval. So uh financial ratios news in the last six months uh a buy sell hold recommendation those were things that in the explanations from our previous LLM as a judge it said we’re missing and we’ve said okay include those things. So uh we are not just getting uh a notice that we are wrong. We are [1:35:21] getting direction from our evals on what we could do to do better and we are feeding that directly into our agent and making the agent better that way. Um this is datadriven prompt engineering. This is what uh Arise is all about. It’s about taking a bunch of stuff uh that uh LM tell us about what are what is failing and what is not failing and turning it into real improvements to our agent. Um so now let’s run an actual experiment. [1:35:51] Uh to do that you need a task for your experiment to run. Uh in this case it is our improved financial report. It’s basically exactly the same agent again. Uh and we’ve given it and we’ve taken that agent and we’ve put it into a task function uh which just runs that agent with the input and output that we’re expecting. Uh we’ve created a new classification evaluator uh again with the label actionability. Uh we’ve given it uh the same actionability template. Um [1:36:22] and the uh created this new evaluator and now we are running uh our we are fetching our data set the of just the failures. uh and we are going to uh run our uh async client which is faster basically uh against that set of just the failures. Uh so you can see here uh tada uh my [1:36:53] appro my improvements to my agent have uh one-shotted the agent from getting uh five out of the 13 responses as actionable or not actionable. uh and it is uh all six of my previously failing uh tests are now running correctly. So you can see that here uh you can see the results in this graph. What I’ve done here is not how it would look in production. If you were in production, you would have got a you would have made some very small change to your prompt [1:37:23] and you’d have got some very marginal improvement across a thousand sets and you would get this small you would get this graph of uh your agent slowly getting better at all of these things. I’ve shoted it here because uh you know it’s already been 90 minutes. We need to get out of this thing sometime. Um but this is the hill that you’re climbing. This is this is literally uh how you get from zero to 100% score is you measure inside of your experiments. Did my prompt change prove prove [1:37:54] anything? Did my experiment get a higher score or a lower score than next time? Sometimes you’re going to make a change that’s going to make your score get worse. Uh and you’re going to have to undo it. Go back to your previous version of your prompt. Change something else. That’s why you treat them like code. Uh the key thing here is that uh a Phoenix experiment doesn’t care what your task does uh at all. So in this case, the task that I gave my experiment was the run the full agent again against a new [1:38:24] set of data. But if uh if our eval had told us uh that our tool calling was bad, I could have just run an eval that only runs the tool calling and that would have been much cheaper and much faster than uh an eval that runs the entire agent. Uh so you can run experiments against a chunk of what your agent is doing, a small subset of what it’s doing and improve that part uh without having to expensively run your whole agent every single time. Uh the power of experiments is [1:38:55] controlled comparison. Uh so uh you get the same inputs, the same evaluators. The only thing that’s changed is the uh agents prompts. And that means that any difference in scores is attributable to your change. uh you’re not wondering whether it scored higher because of your prompt change or because the web search happens to return better results this different this time. Uh you’ve eliminated a major source of variation. Um ideally what you do is you’d run each of these multiple times uh to account for [1:39:25] the non-determinism of your output. Um that is the pass at K concept that I’m going to touch on just towards the end. uh but for now a single run per example gives us a good enough signal uh to tell us whether things were right or wrong. Um and this eval iterate cycle is where the real value lives in evaluation. Uh you get your results, you improve your results, you improve your results and you slowly improve your your app. Uh one of the things that you could do at this point is say why am I as a human doing [1:39:55] this at all? What if I got the output of the eval to and gave it to claude code and said hey claude code go back to my app and improve it somehow and that is closed loop evaluation which we think is very exciting and is definitely going to happen as the models get better where you’ve written the initial version of an app and then you use eval as the feedback mechanism uh to your coding agent which then automatically improves your app without you needing to be involved at all. Uh I’m not doing that here because again uh we’ve all been sitting here and we’re very warm. So I’m [1:40:25] not going to, you know, stretch my welcome uh any further than I’ve already stretched it. Um a good question that you probably have is how many samples do you need? I’ve mentioned 50, 100, 200, 400 samples. Um you don’t have to just eyeball this. You can use math. Uh if you are aiming for an agent that fails only 5% of the time uh or you know 3% of the time uh 200 samples a 3% defect rate will give you [1:40:55] 95% confidence interval uh which would be anywhere between 6 and 5.4%. Um 3% of failure sounds good right 3% is less than 5% uh which is what you were trying to get to. But because of the confidence interval, your actual uh failure rate could be anywhere from 6 to 5.4%. Um if you double the size of your sample to 400, uh you reduce your confidence interval to 1.3% to 4.7%, which means that you’re now constantly [1:41:26] below the 5% threshold and you can ship. Uh but to do it, you had to double the number of samples. So that means you had to double all the effort. You had to double the size of your golden data set. You had to double everything. Uh so at some point uh you need to make a uh costbenefit analysis of like how accurate do I need this agent to be? How much effort am I willing to put in uh to get this thing to be 2% more accurate than it used to be? Uh from workshop scale experiments like [1:41:56] today 12 to 20 examples gets you directional signal. Uh for shipping decisions 200 to 400 examples is a good target. uh how do you make the cycle systematic when you’re iterating? Uh where do you invest your effort? Not every change has the same impact. Uh there is a hierarchy. This is the impact hierarchy that I mentioned right at the beginning. Uh the impact hierarchy tells you where to focus first. Data quality fixes have by far the highest impact. Uh if your agent is searching the wrong sources, if your knowledge base has stale data, uh no amount of prompt [1:42:28] engineering is going to get you there. Uh so uh you should fix the data first. Once you’ve once you’re sure that the data you’re giving to your agent is high quality, uh prompting improvements are the next highest thing to do. Uh few short examples in your prompt, explicit instructions, constraints on what the agent should and shouldn’t do. Those are often the highest ROI changes. Uh and then model selection comes third in the impact hierarchy. Sometimes a more capable model solves problems that prompting can’t. uh but it also costs [1:42:59] more. So you have to make a trade-off about whether or not that’s worth it. Uh and then uh hyperparameter tuning things like temperature, top P, that kind of thing. Uh they are right down at the bottom. They very seldom make a meaningful difference to the outcomes of your evals. Uh one thing that you can consider doing is writing your evals before you build a feature. Uh if you want your agent to always vest verify customer identity before processing a refund for instance you can write an evallet that checks for [1:43:29] that first and that gives you a capability eval and a hill to climb. Uh this is the same as test-driven development which is to say that everyone says it’s a good idea and few people actually do it. Uh eval driven development in practice uh is how things like claude code evolved. Anthropopic built capability evals uh and then gave Claude code a hill to climb. Um when a new model dropped, they would run the suite uh and immediately see uh which of their bets had paid off. They’d [1:43:59] immediately see which of the changes they’d put in in advance uh had actually helped it uh do things better and which ones had not. Um so who can write these evals? Like I said earlier, uh you should be getting your uh non-technical stakeholders involved because they are going to have a much better idea of what good and bad are for the purposes of writing your evals. Um so product managers, customer success reps, uh salespeople, they can all contribute to eval tasks and make [1:44:30] your evals better. » Sorry. » Absolutely. because they can tell you, you know, they can tell you a simple test would be uh this should be present in every single answer and you can turn that into a codiv. Uh but mostly they’re going to be working in prompts. Um the other pattern worth knowing about is the data flywheel. Uh so the more uh expert and uh the more expert judgment you add, the bigger your golden data set, uh the better your [1:45:01] evalu uh and each iter each iteration compounds. So as your eval suite gets more comprehensive, your agents get better, your understanding of failure modes deepens. Uh and what this does is this creates a differentiated data set that becomes a competitive advantage. Nobody has your evals but you. Nobody but you has this long list of production data and production evals that say uh these are all the ways the agent can fail. This creates a moat that other people don’t have uh that can help you uh differentiate your agent against an [1:45:32] another agent in the marketplace that is trying to do the same thing. Uh and one last practical benefit of evals is the uh model adoption uh advantage. Like I said, new models drop all the time. If you’ve got a a comprehensive set of regression evals, then uh you’re going to be able to know uh within a couple of minutes whether or not this new model makes your evals makes your agent worse or better. Uh and whether or not you can ship using the new model. Uh so now you’ll all be relieved to know [1:46:03] that we are nearly at the end of this workshop. Thank you all for staying through to the end. I’m very impressed with you all. Uh I’m going to give you now a quick tour of the things that we didn’t cover so that you know what to Google uh to go even further than where we went today. Uh one is production monitoring. Uh this is something that uh our enterprise product uh um puts a lot of emphasis on that Phoenix does not. Uh once you’ve shipped an agent to production, you can send a certain percentage of your traffic uh to an [1:46:34] evaluation suite and be consistently evaluating all of the time whether or not your agent is performing well. Uh this can show up uh drops in model quality because sometimes those happen uh without a model change. Uh it can show adversarial attacks where people have discovered a way to make your agent fail. Uh and testing against production can guard against that. Uh and it can also test uh agent drift, model drift. Uh as your use case changes, as your product changes, things that used to [1:47:05] work in your agent will stop working. Uh and continuous production eval uh can find those. Uh you can also do cost aware evaluation. Like I said, we used haiku for our agent and sonnet for our judge uh because those are cheap models and they go fast. Um but in production, you can go further. Um you can use different models for different types of queries. So you know if the query is what are your hours uh that doesn’t need the same horsepower as you know analyze the comparative PE ratios of these five [1:47:35] semiconductor companies. So you can do tiered model selection. You can do cheap models for simple queries and expensive models in your agent uh for complex queries. Um, one of the ways you can do this is cost normalized accuracy, which is a Google phrase that I’m dropping in here just so you can Google it. Uh, it’s a way of making these trade-offs concrete. It is accuracy divided by cost. So, an agent that’s 92% accurate uh at 2 cents a query might be better than one uh it might be better value than one that’s 95% accurate at 15 cents a query. And [1:48:07] the eval will tell you whether or not that trade-off is worth it. Uh, and then there’s pair-wise evaluation. Uh, like I said earlier, one of the things that it’s tempting to do with eval is ask the agent to rate something from one to 10. It’s very bad at doing that. A thing that it’s much better at doing is g is give it two examples and ask it to compare which one is better. Uh, that is pair-wise evaluation. You can say out of these two outputs, which one did better? Uh, and it it does a much better job of comparing the two because it has two [1:48:37] concrete examples to work with. Uh, this is especially useful for AB testing prompt versions or model upgrades. Um and then there’s reliability scoring. I mentioned pass at K. Um pass at K asks whether uh can the agent succeed at least once in uh in K tries. Uh and then there’s pass to the power of K which is can it succeed every time in K tries. Uh as as the value of K increases uh these two measures of reliability diverge [1:49:08] dramatically. Um passive K approaches zero. Sorry, pass a k approaches 100% and pass to the power of k approaches zero. Um, which one you care about is going to depend on your use case. Uh, a coding assistant that eventually gets it right, that’s great for passive K. You can just keep trying and and trying until it produces something that works. Uh, whereas a customer support assistant that gets it wrong after five try, you know, every fifth try uh is a failure as far as your customers are concerned. So, uh, pass to [1:49:39] the power of K is how you would measure a customer service bot. Um and then there’s the frontier. There’s multi-judge systems where you can, you know, your LLM as a judge can use multiple judges simultaneously uh to get different opinions uh to look up facts. They can verify claims. Um but that is fundamentally uh the end of what we’ve covered today. This is the loop. uh you instrument, you trace, you eval, you human annotate, you analyze [1:50:11] those annotations, you improve your agent, and then you go back again. Uh some final tips is one is you should start small. Uh you don’t have to do all of this at once. Start by reading your traces. Um 15 minutes of reading real outputs is going to do a lot better than you know hours of fiddling with your prompt if you haven’t read the traces and haven’t read the explanations. Um, write one code eval as your very first eval. Check that it’s in the format that [1:50:41] you expected. Check that the string you’re expecting to be there is there. Uh, and then slowly build up your eval suite from there. Create capability evals first and then as you pass them, turn them into regression evals. Uh, evaluate create evals at the very start of the development. Some add them once they’re at scale. Uh but the time that you need evals is when vibe checking becomes a bottleneck to improvement. When you find uh that changing one thing has broken [1:51:11] three other things without you noticing, that is when you need evals. Uh the first time a regression shows up before it reaches your users instead of after uh you will have justified the cost of building your evals. Uh so now is the time to go and try it for real. Uh you already have a Phoenix cloud account. Uh this is the link to the Phoenix docs. Uh, and Phoenix itself is open source. So if you feel like contributing to Phoenix, uh, that’s great. Um, [1:51:41] and just as a final plug, I should mention that there’s Arise AX. So you can do everything that you can do in Phoenix in Arise AX, but there are some things that you can do Arise in Arise AX that you cannot do in Phoenix, and they are all enterprisy things. So uh if your company is very very touchy about its data then you’re going to want things like sock 2 and other compliance measures uh AX can provide those for you. If you need things like uh multiple teams interacting uh you can do SAML and SSO. Uh if you have a production agent [1:52:11] that is running at billions of rows and billions of traces uh we have a technology called AISDB that helps us do that. Uh AX gives you session aware agent tracing. So, not just individual agent turns, but a user’s entire session from the time they logged in to the time they logged out. Uh, we have an AI assistant called Alex. Uh, we have beautiful graphical representations of what your agent is doing. Uh, we have metrics, we have dashboards, we have monitoring. Uh, it is a significant upgrade in terms of the things that you [1:52:42] can do. Um, but that, uh, is basically it. Um, thank you all for staying all the way to the end. Uh if you have more questions uh I have some time now for questions if you would like if you think of them later I am seldo.com on blue sky uh and you can get these uh slides from that URL. Thank you so much for your time and attention. [1:53:12] Do I have any questions now or is everybody eager to get home? Okay. Hello. So for code events and LM events, can those be defined on the platform itself? So it is run by the platform rather by individual scripts. I might have missed that in the beginning. » Uh no, that’s a great question. Uh Kodi valves and LLM as a judg are run on the client and relayed back [1:53:43] to the server. Uh in AX they can run online on the platform itself. Is is there any chance of using something like the anthropic batch AIS where it uses much cheaper version of like or stuff like that to bars large volumes of data. » Um ask me again afterwards. There’s a gentleman behind you with a question. » Hello. » Oh hello. » Yes. Uh for context, I work with a [1:54:15] construction AI company and we look at like architectural checks and compliance. And one of the issues we have is we have tons of these checks we get from architects and we try to like scope them out to understand what it’s actually we need to build, » right? » And we don’t have a ton of label data on that, but we’ve been starting to do is like try to automatically have cloud code scope it out and then look at consistency. So if we have it scoped out 10 times, is the same solution every single time? And we’ve been using like consistency as um like a almost like an eval for complexity of the problem. [1:54:45] » We’ve been kind of like piecing more of these kinds of things together calling them like meta evals just to like when you’re have a problem you’re still trying to solve. Can you think of anything else like that? Like the actual problem itself you’re creating like the process of solving the problem you’re creating evals for not just the solution to the problem. » Yes, absolutely. I mean we touched on it already with meta valuations, right? like one of a situation where you’re using an an LM to judge another LLM’s output immediately becomes an LLM meta evaluation. Uh you can absolutely use an LLM to judge which of these you know uh [1:55:18] 10 possible agent configurations would have been the better one. Uh that is what I was talking about with multi- aent configurations and multiLM configurations. uh you can get you can do a closed loop situation where an agent is coming up with five possible prompt variations and testing all of them at the same time against the eval uh to see which of these variations without a human getting involved which of these variations is going to improve. Does that answer your question? » Yeah, it does. » Okay. [1:55:49] uh in the front. » Thank you um for the talk. It was it was really great. Uh I had a question on um on how how much evaluation you need to write for a feature because um especially when you run against live traces sometimes the the evaluation can cost more than the actual feature. And so how do you know when you wrote enough [1:56:19] uh » evaluation? » That is a really good question. Um how do you know when you have written uh an evaluation that is good enough is really what you’re asking right? Um the uh the answer is mostly in uh the uh cost equations that I was showing earlier. Um but uh it’s partly in regression evals versus capability evals. If you have a [1:56:49] suite of a hundred regressions and one capability eval, then you’re going to be spending an enormous amount of money doing regression testing. And it’s probably the case that you don’t need all 100 of those regression evals. You can throw, you know, 80% of them out and still have a representative sample of regression evals. uh the one where you don’t want to skip the where you don’t want to skimp on cost is the capability eval. So you can downgrade you can shrink you can compress your regression evals. Uh and your capability eval is [1:57:20] where you probably want to push the boat out in terms of cost. You’re like let’s not care about what model we’re using. Let’s make this as expensive as possible because that is where the uh agent is actually getting better. » Okay. Thank you. And uh on live traces you run both uh regression and uh capability eval or just regression. » Um on live traces there’s no point in writing uh running your capability eval because it’s not changing. Uh yeah » you can cut cost here as well. » Yeah. » Okay. Thank you. [1:57:51] » I have no idea. » Um yeah just a question on like your uh actionability template your custom eval. So like » sorry can you speak up? » Yeah sorry. Um, so a question on like the way you’ve defined your custom LLM as judge. » Yeah. » So in your example, you kind of have basically eight separate things. So here’s four actionable things. Here’s four non-actionables. And that’s all being run as one yes or no. This is some advice online from other people that would say, “Oh, no, you should split that up into eight separate single [1:58:22] checks.” So like then you get a score between zero and eight basically, » right? and you’re running each of those single checks one time as a separate LLM as judge. Do you think that’s the right way to go or do you think kind of bundling them a bit more is is fine? » Um, it’s a very good question because it’s a very um it’s kind of arbitrary, right? It’s like uh what I was trying to measure there like this is where you have to get your non-technical stakeholders involved. [1:58:52] Like what are we trying to measure? We’re trying to measure actionability. Are we trying to measure specifically whether or not it mentioned price earnings ratios? If that is a specific thing that the stakeholder says is important and I need to see that every single time, then you should have an eval about that. If the specific thing you’re looking for is a buy, sell, hold recommendation, which is what I said I was looking for, then a PE ratio is going to help the agent get there, but it’s not the thing I’m looking for. Uh so it absolutely is context dependent [1:59:23] and it and based on what your stakeholders say is the thing that is act the actual definition of correct as opposed to a contributor to correct if you see what I mean. » Sometimes it’s unavoidable like it’s kind of an eitheror as long as it does one of these things. » Exactly. » Cheers. » Uh there was another question back there. » Thank you for the talk. It was uh really good. I have a two question. The first question is um it seems like there are a [1:59:54] lot of uh determinist deterministic factors in the whole evals pipeline. Like for example, if we have this result, what does it mean? Like does it mean like the agent prom is bad or does it mean the rubric we provide is bad or does it mean like if there’s a human annotation maybe the human themselves is not reliable? Do you think it makes sense to go through like phases like first you make sure this is reliable then we go to next uh phase which is to compare I don’t know whatever » yeah absolutely that’s one of the reasons that I I recommend uh building [2:00:26] your evals iteratively like start with the code eval and make sure that the code eval works all the time then build your first LM as a judge and make sure that one is running the way that you want it to if you introduce multiple evals at the same time then you’re going to have the same uh multi-prompt problem that you had before like your prompts change is going to start changing multiple eval simultaneously which is not what you want to do. That’s why you want one capability eval that you’re trying to hill climb at a time uh while you’ve got uh existing trusted evals as [2:00:57] regression evals that you’re expecting not to change. » Uh thank you. And the second question is like you also mentioned a lot of a version the rubric version or the agent prompt version or something but uh my question will be um for example when we change the criteria um and we need do we need to run the you know the whole evals against the new rubrics again maybe we have like 500 um I don’t know data in the data set then we do do we need to run all of this [2:01:28] again or would they consider as also part of this data set like file example I don’t know we can use later. » Um so uh that’s an excellent question. Um that is what experiments are for. Experiments give you a smaller set uh that you can test against your set of failures that you can test against and say am I getting better at this? Uh so experiments allow you to use a small set and rapidly hill climb. Once you think you once you believe that you’ve climbed to the top of the hill or you’re you [2:01:59] know hitting diminishing marginal returns on your hill climb, you should then go back and run against the entire data set to make sure that you haven’t accidentally uh overfit or produced a regression that runs against the entire data set. So you don’t need to run like that’s what experiments are for. You don’t need to run against your entire corpus every single time you make a change. Uh but you should do it periodically when you think that you’ve reached a stopping point or you’re about to ship. » Thank you so much. » Uh there’s one more in the back and then [2:02:29] I think we’re out of time. » Uh thank you. Super informative. Um, as someone who’s like worked in ML and like deep deep learning research in the past, I thought that the idea of doing like closed loop optimization seems really exciting and I think it’s one prospect I’m excited about in particular. But I I’ve used things like um DSPI in the past and I think I don’t know it’s if it’s exactly helpful, but I think um [2:03:00] Andre Karpathy’s like auto research idea is also somewhat related. Um, but I haven’t found anything to be particularly amazing beyond getting me to like a certain threshold where it’s like working okay and then I take it manually and tweak it from there. » I don’t know. Do you have any thoughts on where that space is headed or if there’s » like particular work that seems promising or interesting to you? » Um, come to AIE World’s Fair. We are hoping to present something there where uh we’ve actually made that work. Um but no [2:03:32] uh the answer is it’s very much the frontier right now. Uh the closed loops autonomous self-improving software is something we can see on the horizon. We think it’s the agent you know you know maybe when they release mythos suddenly it will automatically work. Uh but um uh it is it is very difficult to get to work right now which is one of the reasons I didn’t present it today because it’s kind of uh it’s kind of loose right now. » Got it. Thank you. » All right. Thank you so much for sticking around.