Transcript: Why building eval platforms is hard — Phil Hetzel, Braintrust

0:14

All right, it's 11:15. Uh, we're going to go ahead and get started. Before we do, everyone say evals. Evals. I was telling my colleague Rose uh who who's at at the door that I was a adjunct professor for a number of years and um the first uh year that I did it I thought I was going to have this full class of 130 people every single week eager to learn and then as the weeks went on 130 became 60 became 30 became

0:44

10. So, I always tell myself that whenever I give a talk that only about four or five people are going to show up, but I'm going to be really excited to teach those four or five. Um, today is a a a real blessing because um you know, we have we have a a packed house here today. Everyone's excited to learn about uh evals and I I am excited to to teach it. Um here's what we're going to be talking about today. I'll give you a little bit of intro about myself and the company that I work for. um an overview

1:15

of of the problem statement. Uh we'll go into the different stages of when people are building eval platforms and after that we'll we'll talk about um at least in in my opinion where I think eval platforms are going to go. Um but yeah, this is this is me. My name is Phil Hzel. I lead solutions engineering at Brain. I'll go into what Brain Trust is in a second. um solution engineering. That basically means I'm the person and

1:46

my team are the people that make sure that people are getting the most value out of our platform uh and as as quickly as possible. So I'm fortunate because uh throughout all of our customers, I I see uh what the state-of-the-art is in in both eval and and agent observability. Prior to Brain Trust, I spent 12 years in consulting and systems implementation. I worked for KPMG for four years. I worked for a company called Slalom Consulting for eight years

2:17

where uh I led the global data bricks business unit. And I noticed that as I was um helping my my clients with those implementations, they were great. So they were so good at generating these generative AI proofs of concepts and none of them were getting to production. and I wanted to be um I want to be helpful in making sure that those PC's could get to production. So I actually started using Brain Trust because I knew it helped out in this space. I started

2:48

using it as a user and I like the platform so much that I applied for a job and I've been uh been here for about a year. Uh outside of work, uh I I like to play chess, but I'm I'm I'm very bad at it. And I like to spend time with my wife and and my dachshund. Uh dachshund is is named Pistol Pete. And he's pictured. Um he's the person in brown. He's not the person in black. Person in black is me. Um has anyone heard of Brain Trust before? Anyone? Couple

3:20

hands. How many people have heard about Brain Trust for the first time this week? really okay great wonderful uh brain trust for um for just a reminder um I think of ourselves as an agent quality platform and there's a lot of things that can go into quality um the way that we can get to agent quality two main pillars through eval observability which we think of as

3:50

really similar problems to solve eval that's what you're doing with your agent before it gets to production as you're experimenting so that you can become confident in your agent. And then observability is really similar but you're already in production. Your agent is uh in front of real usage from real users and you want to be confident you want to remain confident I should say that your agent is performing the way that you thought that it it would when you were building it. Um so that's brain

4:22

trust. I was specifically told to not make this a sales pitch. So that's like really the last brain trust slide that that you'll get today. Although of course I'm very happy to answer questions about our company this week. But mainly I wanted to talk more conceptually about um how people start to mature and and build build these platforms spoken from a place where we have a lot of experience in the space. Uh first why eval are important. Evals

4:52

are important because um this sounds obvious but LMS have extreme variability. Uh we love LMS because they're highly variable. There are so many different types of problems that LMS can reason to solve. That's why we're that why you know we're so attracted to to them as a technology. Um agents are also of course agents use LMS as as the brain of the agent. Agents are becoming the norm in how customers are

5:22

interacting with companies. People expect an agentic experience now. So if you combine both of those things together, you really need to be confident in how your agent is going to perform once it is in production. Without doing so, you're going to incur or you can potentially incur a great deal of risk um from both a a brand perspective, a compliance perspective, uh and even more of a a cost and and and

5:53

maintenance and systems perspective. So, we want to avoid all of those things happening and make sure that, you know, our customers are having a great experience and that our agents are are um acting the way that we thought that they would act. Um, how many people are like they're doing eval right now, but it's just on a Google sheet or or some spreadsheet. There's probably there's no shame in that, my friend. Raise that hand high. That's great. Um, there's I and I I

6:23

think I think that's great. Like ju just making the step is is really important. It's an acknowledgement of the problem space. And a lot of folks will, you know, they they'll they'll come to us and they'll say, "Well, I don't really understand brain trust because, you know, all I need to know is how to loop through my agent with a couple of different inputs and be able to display some, you know, handwritten notes and scores about that agent. So the things that I mentioned there, three things, some way to execute your agent, some UI,

6:56

sometimes it's as simple as a spreadsheet to show those outputs and scores, and then also a way to to gather input examples. What what I mean by input example is the the thing that can initiate a run of an agent, the thing that can invoke an agent, whatever information that's necessary for that. Um, it would be a really short presentation if if this is all eval was, I would I would thank you for your time and I would walk out the room, but that's not that's not what you're here for. There is a whole other part of the

7:26

iceberg. It's way more complicated than that. There are a lot of things that you end up having to build when you're really serious about evals. We're not going to talk about every single one of these things today. Um, but we will touch on uh on many of them. Uh, and of course, if there's anything here that I don't cover that you're interested in, I'll leave some uh some times uh time for questions for that. Also see like a lot of phones up, so I'm gonna I'll pause for iceberg pictures.

7:58

Um, a couple a couple things while that's happening. Uh, why is this a complicated problem? We already talked a little bit about how the underlying technology is quite complex. Looms are are are not a superficial um engine. Uh but it's also a multi-persona problem uh bu building these agents. It's not just something that engineers do in isolation. It's something where engineers uh whether whether they're product engineers or AI engineers or both systems engineers to get the thing

8:29

running. Um uhmemes that have the domain knowledge. All of these people need to be involved. Uh and then lastly it uh eval themselves become a systems problem. That'll be the last thing that we that we touch on today. So, what are the different stages of of building an eval platform? Um, my my my friend over over there that raised his hand proudly um about starting out in a spreadsheet. This is this is a great place to start. The the most important thing is that you just get started. So, you've got a spreadsheet

8:59

and you've got a for loop. You've got a bunch of input examples that you can iterate through and you have a way to execute your agent. So you can say you can see every time you tweak your agent how the outputs are different over time. Um while this is a great place to start because there is no barrier to entry here. Everyone has some way to access some type of of spreadsheet technology. Um the the returns can be diminishing for a couple of reasons. Um this is more

9:30

I would call this documenting. It it's not really experimenting. So while you have this spreadsheet of you know a bunch of input examples maybe you keep track across each time you are tweaking your agent the the different output that that emitted um that can become cumbersome to to manage over time. Of course um it's really challenging to be able to compare directly experiments over time. you're probably not doing a lot of analytics

10:02

across those experiments and the analytics that you you are doing or performing they're likely coming from some type of human scorer which is which is really valuable but uh challenging to scale in practice um evolves are are a team sport kind of what I was talking about before we want to make sure that we're bringing a ton of people into the fold not just technical folks but also nontechnical folks they can add a lot of value to uh to your agent because of their domain unique domain expertise and proximity to

10:33

users. They're probably not coming into the spreadsheet is is my point. Uh and it's slow. Um each time you eval um you have to go through probably a little bit of a cumbersome process to to recreate or append to the spreadsheet. Um, so, uh, probably the one of the, uh, um, most fun conversations that, uh, I have in my job is I'll have, you know, a very proud product engineer that gets on a call with me and, you know, they they

11:03

puff their chest out and they smirk at me and they say, "Well, I can just vibe code branch. It's no problem." Um, and I think for for like like if if if you're just getting starting your journey, it's a really nice step to to go to. So now instead of being in a spreadsheet land, you're making something a little bit more bespoke for other other people to bring them into the fold. So now you've probably got a for loop. You have a nicer UI now, so it's more approachable. And hopefully you've graduated into some

11:36

database that that isn't Excel or Google Sheets. should probably um you know use roll roll a new database in in something like neon or or something. Um so now you have a a better story around persistence of eval. Uh and because of this you're bringing more people into the fold. You are making UIs that are a little bit more bespoke for your specific users. Um the thing that's that's a problem here is that you're still not really

12:06

iterating yet. you're still performing work that is a little bit more uh just reporting just documentation rather than encouraging a lot of a lot of iteration. So more of a reporting tool here. How many people have vibe coded their own uh UI? Yeah, makes sense. Uh next step here so you want to encourage a lot of um experimentation not just with technical users but with non-technical users. So um you know I'm showing this uh image that is more

12:39

aligned to allowing experimentation for nontechnical users but of course as you're building these platforms you want to allow for more SDK driven experience as well. Uh that just doesn't make for for a very nice image in a in a presentation. So experimentation to me means that you can give a user access to a an agent a a configuration of an agent and a sandbox and you allow them to tweak certain

13:09

parameters within within that agent. In my example here, I'm allowing a user in the UI to change the system instructions to an agent running outside of my eval platform and allowing them to compare two different configurations of that of that of that system prompt and I'm running eval across those two different agent runs so that I can bubble up scores. You can you can see that in the image now. I can bubble up different scores to understand both technically

13:40

and functionally how my agent is behaving. Um, so this is like you'll hear about a lot of platforms have a playground feature. You're going to want some type of playground feature both for technical and nontechnical users. This is where the rubber starts beating the road because the best way to perform eval is to um really think about the failure modes that your agent can fall into um and build scoring functions around those failure modes. The best way to find those failure modes in the first

14:11

place is to have access to production trace data i.e. your agent in front of real users users and and real usage. So the next step here is a really important one. We want to make sure that we can connect what we at least internally we call the flywheel. Observability and eval to us it's actually the same problem from a from a systems perspective. Um funny story we used to be uh three years ago when we started we were only an eval platform and then we

14:41

noticed one of our customers was running this massive eval like a every hour of every day. So we reached out to this person and they said, "Oh yeah, I'm just piping all of my production traffic into this database and I'm running an eval against it." So we're like, "Okay, we should probably just make make that ability to trace uh and and observe actual traffic and be and account for that use case without having to cram it into offline evals. Uh so this is really important make sure

15:12

that we can observe things in production understand the actual behavior of our agents also understand the the real lift that the changes that we're making to our agents are are having. Um so we're analyzing that data. Uh we pull that back those actual examples back into an offline environment and then we improve upon those using offline evals. This is a loop so it's it's not just a process. uh you're going to be performing this loop hopefully for the lifetime of the

15:42

agent that that you're pushing to production. You you you should be iterating this loop as many times as possible. That's how you that's how you improve. Um so as a result of that you you've changed your scope a little. You've widened your scope a lot actually. You are now a tracing platform. You're now a logging platform in addition to being an offline eval platform. Again, the benefit of that is that you got you're starting to get far higher signal from

16:12

how users are interacting with uh with your agents and you can use those real interactions. Uh so you can um almost think about eval production in a safe environment. You're now getting to uh to that point. Um with uh with with this example, um you can also perform online eval. So you can point scoring functions to your uh to your observability traffic and perform things like alerting. Um all things that

16:43

you could build in when you're at this phase of maturity for uh for running emails. Uh the bad here uh if you build it, you have to manage it. So just because you've you know uh vibe coded a platform, guess what? You might get a promotion for it, but also like that's going to be your job now. uh is is to is to manage and and continue to grow your eval platform at the pace that the industry is moving. Um which can be an exciting challenge. That's kind of the bet that that our company's making and

17:14

we're excited to solve that problem. The more important challenge though is that agent traces specifically, if you kind of look on the on the screen, these are really nasty. They're not like normal application traces. Um they are they are really semistructured. A lot of times they're unstructured. there's just a ton of text inherent to LM problems that we're that we're solving. Um they're um just like very large in addition to being complicated. So if you're trying to cram, you know, a a one gigabyte

17:47

trace into a Postgress row, that can lead to a lot of performance problems. Uh and they're numerous. It's high velocity uh because there's so much usage happening in production hopefully with the with the agent that you've pushed. Um, so this is how we used to solve this problem. Uh, just as an example, if if you're at this stage in maturity, you've got traces coming in. You're going to need to account for two query patterns. One, if you're performing observability, you need a way for for folks to

18:17

instantly be able to see their traces. It's very important to people. So, you'll need uh a a a very low latency way to ingest data. And then uh you also need a a second uh um a second layer of persistence for the query pattern of I want to be able to analyze in in aggregate these data. So we used to use an open- source data warehouse for this. And we used to stitch these two sources together uh through a a a domain

18:50

specific language that we created called BTQL that no one liked and and including us we we hated it. And then we would perform like a third level of of aggregation with using duct db in the in the browser. Um this worked for us for for a bit h and then it it didn't work when um I'll just use one of our customer examples a customer like notion for um as an example just a ton of a lot of of unstructured data that they were sending us. They wanted to be able to

19:20

perform things like full text search across a trace. None of these technologies are really equipped to perform text style um analytics which is a challenge with with the LM domain because there just so much text. So that leads us to this um measuring Asian quality performing eval performing observability. It's actually a systems problem. It's not just a UI UX problem. We recognize that it's quite easy to vibe code the UI of evals. Um, but it's

19:50

way way way more challenging to create that data layer of of running a successful eval and observability platform. And not just from a scale perspective, although that matters, mostly from a functional perspective of allowing people to do the things that they would expect to do like performing full text search across millions of traces um in in their platform of choice. Um I talked about this a little bit. The the reason why this is such a novel problem to solve is across a lot

20:22

of these dimensions, which I you know I won't drain this slide, but um the data comes in really fast. uh the data are are are like just really large when they come in. So even even though traditional uh spans in a trace span is just like one part of a trace, traditional span would be like a couple kilobytes. Here we've seen spans that are 10 20 megabytes in size just so much context within those spans. Um highly highly unstructured and then also there are a

20:52

lot of different types of read patterns. So um you might be performing aggregate types of read patterns but also you want very low latency types of of read read patterns. So none of these problems are are individually unique but together they make a make for a very uh unique problem from a systems perspective. Uh so what we've done is um and you know what you all would have to endeavor to do if you were building this yourselves is you really have to think about making

21:23

the right data platform for traces so that you can perform some of the more functional um requirements that that eventually come down the line. The example that I have here is that you know let's say that you want to let a coding agent loose on your evals platform so that you can be a little bit more self-healing with grabbing data in aggregate from your evals platform using a coding agent to grab that into context

21:53

and change your agent uh within um you know with with within a a coding agent session. That's something that's going to be really challenging to do if you can't run a lot of just pure SQL on the data back end of of your evals platform. We've actually noticed a lot of these headless style use cases come up where people aren't in interested in the UI at all. The only thing that they're interested in is how can I perform eval in a way where I can use a codeex or I can use a cloud code to to help uh um

22:29

help increase the quality of my agent for me. So the the last problem here that that I'll talk about is the so what problem. Um and we'll we'll skip this for now for the sake of time. This is how Brain Trust does this. We have a blog about this if if you're interested that that just got released. Um um but what what kind of comes next here for um what you can expect to build into your evals platform is you want to be able to tell folks the unknown unknowns of your agent. I.e. don't make me look

22:59

across a whole bunch of traces. Just tell me how people are are using uh our our agent. Uh so you want to be able to uncover those unknown unknowns unknown unknowns through topic modeling techniques so that you know where to spend your engineering time. Um you want to make sure that you are building your platform not just for humans but also for agents because that's one of the main media for how people are are creating technology. Now um we didn't even talk about the non-functional requirements that go into building these

23:30

platforms like role-based access control data masking. That's also something that that's super important that comes up when you want to operate at scale. Uh and then lastly, uh a consideration for adding automatic tracing through some type of AI proxy or or gateway so that people don't even have a choice but to trace their their LMS. Um you can govern very centrally by adding tracing automatically uh to uh to your eval platform. Um so I appreciate the time. I'm I've

24:01

got like a minute and 20 seconds left for for questions. I can probably take two of them if anyone has any questions. Yes. >> Okay. Um I'm not sure about these kind of problems often you create like dynamic promps only like solutions.

24:31

Then you often build like custom version. How does brain trust? >> So like how does brain brain trust specifically handle multimodal outputs and inputs and traces? >> Yeah, we like just very technically we uh put them in some object storage, reference them and then um display them directly into the trace. So if you have like an audio file or a video file, you can play it in the trace when someone's reviewing the the trace itself. We don't

25:01

want people to have to exit the platform for that. >> The prompt management is in brain trust. >> It could be. Yeah. A question was is prompt management in brain trust? It it it could be or it doesn't have to be. Yeah. Okay. Perfect. Thank you so much for your attention today.