← Back to library

Transcript: The Production AI Playbook: Deploying Agents at Enterprise Scale — Sandipan Bhaumik, Databricks

AI Engineer37:05Transcript ✅Added Yesterday, 1:52 am GMT+8

Source video ID: ObTPqBGsEbA

Transcript

  • 0:07 — [music] » All right. Um, thank you for joining my session. » [applause] » Thank you, man. Uh, I’m Sandy. Uh, I’m a technical lead uh, for data and AI at Databricks. Um, prior to working in Databricks, I worked in Amazon Web Services uh, for 5 years as a principal architect for data and AI. Uh, in the past few years, I worked extensively uh, building and scaling
  • 0:39 — data and AI platforms using distributed systems and technology. And in the past couple of years, specifically, I’ve been working with customers trying to figure out what we do with this new AI technology. Uh, when I say new AI, AI has been here for a long time, but we all started experimenting quite exponentially uh, in in the past couple of years, right? And I have learned a great deal of lessons on from building demos and how to take those demos to production working with
  • 1:10 — different customers in uh, B2B software industries and then uh, regulating industries like financial uh, services. So, in this session, I want to share a playbook, a framework that I put together from lessons that I have learned working in the trenches uh, that you can take and apply on when you think about how to put your AI systems into production. And I think this session is nicely placed in the afternoon because what you can do now is
  • 1:40 — in this framework, you can fit the different um, knowledge, the knowledge that you’ve gathered attending these different sessions throughout the day and see where they fit in each of these you know, elements in the framework. So, when I started 2 years ago, this is the pattern I noticed in every customer conversation, right? So, everyone wanted to do something with AI. Uh, there was immense pressure from the top to do something, to build a demo. And every conversation started with let’s choose the model, right? And it was nobody’s fault because the market
  • 2:11 — was like that. We were talking about models, the models were new technology for us, right? And every conversation started, shall we use GPT? Shall we use Claude? You know, there was huge debate with with within organizations. Then, you would choose a model, you’ll build some features, offsets of over what features to build for that application. Uh, you would build that in a controlled environment, so predictable data sets, you know, um, limited scenarios, and then it looked great as a demo, and then
  • 2:41 — leadership would get happy, they would sign it off, and they’ll put it into an environment in a production environment. Then, after a few weeks, people would start asking questions that what the hell is AI doing? Right? Why is it not answering the questions the way we expected it to answer when we were doing the demos? Uh, it would result in not only less you know, no realization in return on investment, but also loss of money and effort in building these demos that can never scale to production.
  • 3:15 — Throughout these uh, meetings, I gathered three insights that connect to everything that you we are talking about when thinking about taking uh, AI to production. The first one is the observability gap, right? When we use AI and put it into production, if we can’t see what it is actually doing, if we can’t trace every decision that it’s making, it’s no use in production. Second is the evaluation gap gap. A lot of these conversations that we were doing, we were not actually thinking about what is what is that one thing
  • 3:47 — that we are measuring. Yes, we talk about accuracy, we talk about latency, we talk about groundedness, but we were not defining what is that exact thing like that matters to the business, and how can we build a system that can continuously measure that, whether it’s improving, whether it’s not improving, like what what is that system that we need to build. And that was that evaluation gap that I noticed. And the third is the governance gap. Like we were not actually thinking what happens when AI fails in production. Who’s
  • 4:17 — accountable? Who do I go to when something happens at 3:00 a.m. in the morning, right? Who needs to own the data assets that feed some AI responses? What happens if AI you know, uh, talks um, uh, um, nonsense to a customer, right? What what happens, right? So, there is no accountability, no governance around it. And these three insights led me to build a framework on how I think AI should be taken to
  • 4:48 — production, and this has been implemented across multiple customer organizations, and I think this is something that you can pick up from here. These are the five pillars, and these are absolutely what you need to think about even before starting a project, right? Then you start build them gradually, preferably in sequence, but in real life, I know that this sequence don’t work, but these are the pillars that you have to know about and you have to think about when start building. First one is evaluation. Before touching any code, before discussing about any models, any
  • 5:18 — features, you have to think about when we build this system, how do we measure? What does success look like, and what is that system that will help us continuously measure what success looks like for us? Second is how do we trace each and every decision that AI makes. It’s not only important for the performance of the AI system, it is also important for the regulators. In Europe or in a lot of companies, especially in regulated industry, you cannot even onboard AI into
  • 5:49 — production without having tracing and observability in place. So, this is a must-have. The third is the data foundation, right? Uh, I I I I think of data foundation in in two ways. One is the question data, so that is basically the data needed for the AI to answer questions that users ask to it. So, it could be your pre-training data, post post-training data, data that you use APIs to hook onto and and get to the answer that the user needs. The other one is the tracking data, related to the
  • 6:19 — tracing data in observability, but when you think from the data foundation and and uh, data strategy perspective, this needs to be handled in this pillar because you need a whole data strategy now with tracing data, especially when you run hundreds of agents in your organization. Fourth is orchestration. One agent would work pretty well. You don’t need to think about orchestration. But when you onboard five agents, the the complexity increases exponentially, right? You will have multiple coordination patterns
  • 6:49 — between these agents, they will need to talk to each other in multiple different ways, they will need to each wait for each other’s responses, there’s a lot of complexity that comes in. And that’s where orchestration patterns and thinking about how you will orchestrate your agents in a particular system becomes really important. Fifth is governance. This is where you think about what happens when something fails. Who’s accountable? How do we govern data? How do we secure it? How do we secure our systems? How what do we make sure how do we make sure that no one injects into our agent and leads
  • 7:21 — uh, to you know, misbehavior, right? Or loss of reputation. So, in the rest of the session, I will dive a bit deeper into each of these pillars and tell you how how you can think about when you start working with them, right? The first one is evaluation. Evaluation is basically specification for your AI system. You define success. As I mentioned, it’s not like talking about accuracy. You have to define it with numbers, like what accuracy is is is good for your business use case, right?
  • 7:51 — Define it in numbers. Uh, define what kind of you know, false positives you can handle. What should be the deflection? So, this is this is an example from a a retail chatbot, right? A banking chatbot where when you implement a chatbot with an AI agent, one of the main goals is to deflect simple queries um, to the agent so that a human agent don’t need to uh, deal with them, right? And so, you need to uh, you need to track those queries and track those numbers and put that system in place. Second is building those test
  • 8:23 — test cases, like the evaluation data set. You’ve heard about golden data sets in evaluation. Talk with the domain experts and find what is actually happening in real life on the ground. Like what answer would a support human support agent um, give to a customer on a particular question. Collect those information. What happens in gray areas, in edge cases, like what happens when a human sees a customer asking a confusing question, right? Collect those into a data set, and then automate your AI testing, right? So, you put a question
  • 8:54 — to AI, it answers, take that answer, compare against the test set, and automate this whole pipeline so that when you put AI in production, that pipeline can actually take live responses and evaluate against the test data set that you’re building, and then give you the result in terms of how AI is performing against those numbers and the goals that you’ve defined. When we talk about evaluation, there are three main layers that I see appear across organization, and this is
  • 9:24 — an architectural decision that you need to make when you build these evaluation systems, right? The first layer is deterministic. These are the easy stuff, like you know, checking formats, you know, checking email formats, phone formats, the regular expression things that we have already been doing with our coding systems, right? The uh, the the other other is like you know, you could use a classic ML models for name entity recognition to for intent classification, for understanding what is first name, last name, PII detection, etc.
  • 9:54 — So, the these these are easy stuff, cheap stuff, you should get them out of the way. We have already been doing this for years. The second layer is the non-deterministic semantic stuff, all right? This is where groundedness comes in. This is where we implement technologies like LLMs judges. We all know what LLMs judges are, right? Everyone? Okay, I see a lot of nods. So, um Again, this This is a pretty simple version of how a uh how a prompt would look for an LLM as a judge. Um
  • 10:24 — So, with LLM as a judge, you you you use a separate LLM from the primary LLM to judge the response of the primary model. And when you do that, you tell the secondary, the judge model, on how it should uh judge the primary model’s output. So, it could be around safety, groundedness, you know, relevance to the answer, etc. etc., right? Again, that can feed from a lot of uh the evaluation data set that you have created, right? To look at what are the expected answers, and then it can
  • 10:55 — check against that. This is a sample prompt on how these things work, but I’m sure you’ve attended some of these sessions where you’ve seen vendors doing this automatically at scale. Uh for example, in Databricks we in MLflow you’ll find automatic LLM as judge, uh where you can create these custom LLM as judges that run automatically on traces. That’s your second layer. The third layer is behavioral, right? This is where uh you think about a tool calls, like is our agents calling the right tool? Are they getting into loops? So,
  • 11:25 — for example, um you know, the first layer you you can have a user ask a question, “What is my account balance?” And you could go and check that, okay, this there is no deterministic problem with it. The seman- the agent answered right, that, “Okay, your account balance is this many dollars.” And that was right, and you can see this is this is right, but when you go into the behavioral checks, you will see that the agent uh actually made three calls to the database to find that answer. Right? And that is because it was doing making duplicate calls for whatever
  • 11:55 — reason. Calls failed, you know, calls did not work, it went and retried and stuff like that. Now, three API calls in demo environment is fine, but in production, when you get thousands of queries from users every day, and there’s like duplication in API calls, that’s an expensive operation. And that’s where you need to think about behavioral evaluation. And this layer is very, very important. I see a lot of organizations, a lot of teams miss them when when talking about this.
  • 12:25 — The second layer is observability, right? Uh and in this pillar, what we’re talking about tracing, right? So, you collect [snorts] all the decisions that an agent is making. So, I want to explain this with a scenario here, right? And this is a scenario from an actual project I worked on with a banking uh retail retail banking chatbot. Now, obviously, if you’ve seen tracing data, it’s not as beautiful as this slide, right? So, I’ve simplified it and made it beautiful for this slide. But what this slide says is basically, a user comes in and says, uh “You know, I
  • 12:55 — have been charged an overdraft fee, can you waive it for me?” Because the user thinks that the customer thinks that that is not legitimate. So, the agent does an intent classification, and you all you know about this because you’ve enabled observability, you’re capturing traces, and you’re actually seeing what the agent is doing, right? What AI is doing. Intent classification, it is done, this it took this many seconds, this was confidence score. Then it goes and connects to the customer’s account, maybe in a database, a customer database, call calls an API, connects to the customer database, gets the account details.
  • 13:25 — It retrieves policy documents. It checks from a rag vector database, um what is uh what is the policy around overdraft, right? Is what the customer claiming is legitimate? So, it checks for policy documents. Then it goes and does a reasoning on what should be uh you know, responded to the customer, and then it does some final guardrail checks, and responds to the customer. Now, if you did not set up a system that helps you look visualize all of these
  • 13:55 — traces, when the customer comes to you and raises a dispute, you have no way to check what the AI did. Right? You have nowhere to go, and you end up saying that I don’t have have any idea. Let’s Let’s give the customer a discount or something, and then make them happy. So, this is why you need this, and this is why regulators are are are basically mandating, because otherwise there’s no production system if you cannot do this kind of stuff. So, this is where um you know, you you you detect this the
  • 14:26 — example that I gave around duplicate API calls. This is where you start detecting this stuff. So, when you when you enable these traces, you can actually go and see duplicate calls, and then take relevant actions based on that. Not only that, you can actually do that in online monitoring. So, when it’s happening in production, in on you can set up online monitoring, and at that point, if it is doing duplicate calls, you can apply fallback strategies. Or even if it is doing a call that is failing, you can actually go and apply a strategy where it will say, “Okay, go and retry for three times, not more than three times.
  • 14:57 — If it if it is more than three times, then report somewhere, or pass it to a human to take some action.” The third pillar is the most important pillar, in my opinion, is the data data data foundation, right? Uh in my typical project projects, I spend 60% of my time, uh and I see I see a lot of organizations spending a lot of time here, because no one expected agents to come suddenly in the market and start querying data.
  • 15:27 — Data was always built for humans, and humans are always forgiving. You find the wrong data in a report, you just go and ask someone to correct it. Agents don’t forgive you, right? Agents will go, find it wrong, they’ll give you the wrong answer confidently. Right? And you wouldn’t know what’s happening. And this is why data quality, setting the right data strategy, has become so important for enterprises now. I divide it into two sections. One is the question data, as I was explaining, like data needed for actually serving the AI’s uh outcome.
  • 15:59 — And the other one is the tracking data. This is the observability data, the tracing data I was talking about earlier. You need a proper plan on how you collect this tracing data, and how you serve it to auditors, to regulators, to do online monitoring, to run LLM as judges on the tracing, and everything else, right? So, there it needs a proper strategy on how you structure the schema and everything on the tracing data. Um On Databricks, um we
  • 16:31 — create a robust data foundation for our customers using uh some of the technologies that we provide. If you don’t know Databricks, Databricks has been built on some open-source technologies like Apache Spark, MLflow, and Delta Lake. Uh we provide a bunch of capabilities on top of it. So, the blue layer at the bottom is basically your cloud storage. Databricks works on the three major clouds, Google, AWS, Azure. Okay. I thought it was for me.
  • 17:01 — So, so once you once you store raw data on your cloud storage, uh the data is then um um we we we bring in a a layer called the Delta Lake layer, which uh which basically brings in database-like properties on top of your raw data. So, you have got images, text files, video files, or whatever. We we help you create this um you know, uh table-like structure on top of it using manifest files, right? And and we help you to uh incrementally load data, do all of those um data management
  • 17:32 — tasks in a structured way. On top of that, we bring in Unity Catalog, which is a data catalog. Uh with Unity Catalog, you can centrally apply permissions on top of the data. You can um you can uh share the data using uh Delta Sharing, but also uh what happens with Unity Catalog is uh you you can enable discovery and um you know, um ownership, metadata tagging capabilities at the catalog level. What that means is, when you apply table a description,
  • 18:03 — column description, uh tag columns uh like PII columns with metadata, it becomes really easy for AI to then get that context when it queries these tables on top of Unity Catalog. So, everything is governed at one layer through Unity Catalog, and on on top of that we bring in different uh applications. So, whether it’s AI through Mosaic AI, so to build LLM, tune LLM, or even build AI applications, we bring in uh data warehousing capabilities, BI capabilities, and um
  • 18:33 — uh and some of the other text-to-SQL capabilities. We have got Genie that uh helps you write natural language to do SQL querying, etc. And one application of that in the observability and tracking tracking data, as I was showing, is is this. So, basically, think about when I was talking about the tracking data strategy. Organizations, especially enterprises, will not be running AI in just one framework. They’ll be using different frameworks, CrewAI, LangChain, etc. etc. They’ll be using different cloud platforms.
  • 19:04 — And once they do that, you need a centralized layer of collecting that tracing data, so that you can serve sev- several use cases on the right hand side. So, whether it’s for operational dashboarding, for first line support, uh a lot of these uh first line um first line of defense teams need health monitoring uh sort of dashboards, right? These teams can also write SQL using Databricks Genie to do text-to-SQL. But they can also build Databricks apps using coding agents uh
  • 19:34 — to create common workspaces or custom uh UIs that customers might need for different uh different use cases. And then we’ve got Agent Bricks and MLflow that serves you uh LLM out of the box LLM as judges, and uh proactively monitor a The idea is, no no no matter where your AI runs, you can create this kind of strategy bringing in data in one common place and serving uh different teams from one shared location.
  • 20:05 — The fourth pillar is multi-agent orchestration patterns. As I said, one agent is good, multiple agents increases complexity. That’s where you start thinking about, okay, what pattern is good for my use case. The first one I describe here is the orchestrator worker pattern. Where you have one orchestrator which orchestrates all the work, which controls all the work from a centralized plane, and then distributes this work to different agents based on their specialized skills. And then every request goes through the orchestrator, so you have got central
  • 20:36 — control. If something goes wrong, you can go to the orchestrator logs and look into them and see what has happened. Right? So, that’s the orchestration data uh pattern. There is this choreography pattern where each agent is independent, they’re autonomous, they don’t depend on an orchestrator. All of them talk to a message bus and they listen to the events that they are interested in. Right? So, think about agents that are independent of each other, right? They can run parallelly. So, they are not sequential, like one agent is not dependent on another. So, they run
  • 21:07 — parallelly, they listen to the message bus for the for the events that they are interested in. Maybe it’s a trigger for, let’s say, a mortgage application, and it says, uh you know, uh the mortgage application agent uh one of the agent uh looks customer details, right? The other agent looks at approval details and everything else, right? They can work in parallel, and the advantage it brings you is the latency is reduced because they are not dependent on an orchestrator and sending messages back and forth. Right? So, this is the choreography pattern. And the third one is human in the loop, which is
  • 21:39 — where when an agent crosses a threshold or serves below threshold a confidence threshold, then a human is called in the workflow to look into the pattern uh so, look into the looking into what the agent has done and then take action based on that. I have done a deep dive video on multi-agent orchestration pattern uh for the online track of this conference. Uh it’s already on YouTube, so you can look into it. I talk about the real implications of when you think
  • 22:09 — about multi-agent patterns. One is uh state management, the other is fault tolerance, like what happens when things fail, like how do you manage them? I talk about different patterns. And then talk about how how you think about scaling them in large scale on enterprises. Pillar five is governance, right? Now, here I’m not talking about data governance at all. That’s given, we need that, right? From AI perspective, what what what are we thinking about? Regulatory, right? Audit trails, have we got the trail of every action, every
  • 22:39 — user connection, every request, everything that happens in the system? Are we capturing everything? Are we doing pre-validation of personal information? Are we using name entity recognition? The the easy stuff, the rejects and all of those things, right? In our example, the work that I was doing with the customer that I mentioned, we already detected 47 PII breaches during the testing phase by applying this layer. So, that’s that’s really important. Fourth is um prompt versioning. You have to treat prompt versioning as change
  • 23:10 — management in enterprise grade solution. It cannot be just change to a prompt and commit to get. It has to be is it has to go through proper change management processes as you do with code. So, basically treating prompt as code. Third is model change management. So, as models change, the model providers upgrade these models, you have to have a system to understand whether that upgraded model will be good for your use case, for your data. Right? Model providers you will put evaluation benchmarks on three uh benchmark uh
  • 23:41 — boards, but those are not really useful when you put them in your context, in your enterprise. So, that’s where these evaluation data sets come in handy, where you try these different models on this evaluation data set and try to understand which one performs better. And that management needs to be done because from a risk perspective, you cannot really rely on one single model. You have to have the flexibility to switch to different models and also test them on your own data. That management needs to be done.
  • 24:12 — Uh in Databricks, uh we have taken all of these these points, these pillars that I’ve been talking about into Agent Bricks. We are building Agent Bricks to make uh all of these operations out of the box for you, uh so that it’s easy to implement production grade AI applications on uh on in in your enterprises. So, I wanted to quickly touch upon a case study, just to give you a uh a flavor of how these things go, right?
  • 24:42 — So, when I was working with this client um they were a retail banking they were building a retail banking chatbot you know, one and a half 18 months ago. Uh their their problem the the problem they wanted to solve is they had got around 20,000 odd calls per month from customers on their chatbot. They wanted to deflect they they they saw that there were like 60% of them were simple queries, what is my account balance, you know, what do I do with my overdraft and all of those stuff, like that can be answered simply. So, they wanted to the
  • 25:13 — reliance on human agents for those answers. So, they identified those queries and they wanted to automate them. Right? They spent around 85K in 6 months doing a POC which did not succeed. When we got involved, we found those insights that I was talking like no one knew why things were failing when it was in production when when when when it was in production. No one could actually measure why why it’s not succeeding and no one could actually understand who is accountable
  • 25:44 — for what when things go wrong. Right? So, the goal we set for them is AI agent handles 60% of user queries, right? Which were simple user queries and then a way to identify and track them. The key difference in this project that we when we did is that we selected the model in week seven, like in a eight weeks POC. Right? And this is how it turned out. For the week one and two, we built the evaluation layer. We collected 200 cases on their actual human agents
  • 26:15 — answering to their customers on simple queries and understand how they are responding to them. We created that database. Then we defined the success metrics. What does success look like to you? So, out of let’s say 100 queries, you need 60 queries or the 60% of the queries that are simple queries to be uh to be handled by the agent, right? They needed some sort of accuracy. So, 85% it was around 85% accuracy target. They needed latency, all of the operational targets that you need. They were there.
  • 26:46 — Then we created this automated evaluation pipeline for them. And what what I mean by that is an automated system where you can capture a user’s uh a user’s question and the AI agent’s response. You take that, compare that against your evaluation data set. You rate that, and if the rating is below certain threshold, you get it checked by a human. And you if if something goes wrong, you make sure that you find the solution. So, it could be a change to the prompt, it could be change to a tool calling system or something else. Once you have done that, you add
  • 27:18 — that test case in the test data set. So, that when it happens next time, the test cases cases catch them. So, the the the the summary of that story is that your evaluation data set is a living system. You start with 200, maybe there is no correct number here, but once you start, as you start building in production, this is a living system. This will keep growing. And the and the bigger it grows, the better your system will be. In the second week, we talked we thought about the foundational layer, right? So,
  • 27:48 — the the question data, we thought we thought that, okay, if you have to call the database, have you got the API connections right? Have you got a system in place that can trace the API connections? Are those secure, right? We were not talking about MCP at that time. Right? It was just direct API calls to database to run queries. Have you got the distributed storage? Have you got the Have you Are you collecting traces? And this is where when when we started testing after building these systems, we could catch those duplicate API calls,
  • 28:19 — right? We could catch why customer satisfaction was dropping and stuff like that. And then comes in week seven to eight, we started talking about models. Now that we had the evaluation data set, we could run different models on that data set to see the responses, compare them against the expected responses, and calculate a number on on on the accuracy, right? That helped us to decide which model to use. Now, that decision didn’t take long, right? We in as I explained in the introduction, like we spent weeks
  • 28:49 — debating on which model to use, but when you took the other approach, you can actually uh do that in a very quick way. So, once once that’s done, we we stitched everything that I was talking about around observability, evaluation, the layers of evaluation. Once we had that system that can make AI visible, measurable, and accountable, that’s when we started launching it to production. And that’s when uh so, this is this is
  • 29:19 — the result uh six weeks post launch, uh we we calculated the operational metrics, of course, you know, the accuracy, the deflection rate, the response time, uh the customer CSAT. But what’s important here is in few weeks time, when uh there was a problem with uh so, one of the one of the things that happened was that the bank changed some uh interest rate related policies. So, when that when they changed the policy, they actually sent emails to customers or notifications in the application in the
  • 29:49 — in the in the mobile banking app about the policy change. But when the customers came and queried on the chatbot for further questions, they couldn’t get the right answers and they were like putting thumbs down on the answers. So they were getting this feedback, right? So feedback decreased. The problem with this kind of system, if you did not have this measurement system, is that you couldn’t actually know what’s happening. But because we had the measurement system in place, these the drop in C set was detected, right? Because we were getting negative
  • 30:19 — feedback from customers. We could actually look into the tracing decisions and see that the agent was looking at a policy policy document that was outdated. So the the new policy document was not updated in the vector database. The embeddings did not come through. Because it did not come through, it was giving it stale answers. And that’s when we went and fixed that. But it it was all possible because we built that those systems that that led us to to detect this.
  • 30:53 — Before you go, I I generally in these sessions I share different artifacts that you can take away. I have a QR code at the end for you to download and you will find multiple artifacts. One of the important artifacts that I want to talk about is the production incident playbook. This is something that a lot of us tend to miss when we work in AI projects. And this playbook is basically a definition of what needs to happen when things fail in production. First, you detect using your eval dashboard.
  • 31:23 — Then you diagnose using your tracing as I explained. Then you contain. So basically, you you are versioning your prompts. Is there a Is there is a If there is a problem with the prompt, you you take that prompt out, right? And start the changes, deflect it to a human. Or in my multi-agent orchestration video, I’ve talked about multiple fault tolerance failure recovery patterns around saga pattern, compensation pattern, and circuit breaker pattern that you can look into the video. I’ve explained them in details on how you can handle them.
  • 31:55 — And then you use the test case library to fix. So you look into LLM’s judge reports, you look into your evaluation data set reports. Then you fix your problem. Once you fix your problem, you put those test cases in your data set, right? And and create that eval suite that is a living system that will keep growing. And and and you you keep improving your AI system based on that, right? But this playbook needs to be in place. When it runs in production, you will need to integrate it with your ITSM system so that it alerts the right person at the
  • 32:26 — right time. You know, a lot of these organizations would have existing ITSM systems, right? So which which is used for alerting and you know, making sure that the downstream systems don’t get affected, etc. So once once you have this in place, you can go and stitch it together to other systems. So what can you do tomorrow, right? Start with If If you have a project in mind, start with defining success. Success not from the technical sense, from the business sense. What it means
  • 32:56 — means for the business, right? Come up with a few examples of what good answers look like. And and create a data data set of that. And then build that pipeline using simple Python code. See if you can automate that so that when you run AI and get some response, it can become it can go and compare You can go and compare the answer against that data set and then that can be delivered to to the to the customer.
  • 33:26 — Now, these are three lessons that I have learned while doing these things with my you you know, my easily miss. The test case library, as I explained, is a growing system. It will grow over time. And because it grows over time, you need some sort of governance around it. You need a owner, right? You need to You need to figure out which test cases relate to what kind of problem. So that whenever you go back to it, you can you can relate your answers to those sort of problems, right? If it is a security If it is login, so you can say
  • 33:57 — that the agent did not ask for login credentials when the customer asked the answer. And all those kind of issues can be put under a security category within that data set. So to categorize the rows in your data set so that you can pick up what changed and compare it with them. The second is prompt versioning. Now, when you start versioning prompts using Git, you know, we all know when you put Git message commit messages tend to be simple commit messages. But you have to put governance around what kind of commit messages you are putting in when
  • 34:28 — you’re changing these prompts because you need to understand when a prompt was changed, for exact what reason it was changed, right? What was the failure that caused this prompt to be changed? What kind of failure would it address and what would it correct, right? In the next version. That needs to be documented. Otherwise, it becomes difficult because when you go back and look into prompt versioning and look at different versions and you cannot trace why why those changes were made, then it becomes difficult to track what’s happening. The third, the layer three evals, right?
  • 34:58 — So the behavioral evals that I was talking about around tool calls and stuff like that, they can be really expensive as you grow your eval data set as well. So when you have a wrong tool call, for example, and you want to correct that system, when you correct it and run it against the eval data set, you have to basically run it against let’s say if you have got 300, 400, 500 rows in the data set, you have to run it against them. And you do all the testing again and again and again and again and again, that can cost you a lot of money. So you have to put some governance around that.
  • 35:29 — So for example, when in your continuous integration pipeline, when you do the prompt change, you can actually put some checks around just just selecting a small subset of the eval data set to do the testing. And you only do the full test when you merge to the main branch. So you can put these kind of decisions in place so that you can reduce cost around around you know, expensive eval decision. If you scan this QR code, it’ll take you
  • 36:00 — to a Google Drive link where I have put some examples on some of these how these templates look like, what evaluation checklist should look like. I’ve given you some guide on set setting up tracing with open source technologies so that you can quickly set up some tracing and start testing in the test environment before you decide on what kind of tools you want to use. Thank you very much for listening to me. This is This QR code will take you to my
  • 36:30 — LinkedIn profile. So I share I have a newsletter where I share this kind of topics every week. So if you’re interested, you can join. It’s free. I basically share what I learn in in the field working with customers, right? So it might be useful for you. Thank you very much. » [applause]