Why building eval platforms is hard — Phil Hetzel, Braintrust

Video: https://youtu.be/_fQ7Z_Wfouk?si=sUADBJFPxxjFAX9q

Video ID: `_fQ7Z_Wfouk`

Duration: 25:39

Source quality: direct transcript extracted successfully; comments extracted from the top available YouTube comments.

Core thesis

An eval platform starts as “a spreadsheet plus a for-loop,” but it quickly becomes a serious agent-quality data system. The real problem is not drawing a comparison UI. The hard part is supporting a continuous loop between offline evals and production observability while storing, searching, scoring, and analyzing enormous semi-structured agent traces.

Phil’s central point: evals are not a one-time QA ritual. They are the operating system for improving agents after launch.

Big ideas / key insights

1. Evals and observability are the same loop

At 3:20–4:22, Braintrust is framed around two pillars: evals before production and observability after production. Later, at 14:11–15:42, Phil collapses those into one flywheel: observe real production behavior, pull representative cases back into offline evals, improve the agent, then keep monitoring the effect in production.

That is the strongest idea in the talk. The best eval datasets are not invented in a room; they are harvested from actual user failures, edge cases, and surprises.

2. Spreadsheets are a good starting point, not a destination

At 5:53–9:30, Phil gives the spreadsheet stage its due: it proves the team understands there is a quality problem. A for-loop, examples, outputs, handwritten notes, and scores are enough to begin.

But the limitation is that this is mostly documentation, not experimentation. It is hard to compare runs, scale scoring, involve non-technical domain experts, or keep iteration fast.

3. Vibe-coded eval UIs still hit the same wall

At 11:03–12:06, he describes the proud engineer who says they can “just vibe code” an eval UI. That can be a useful next step: nicer UI, real persistence, more people in the loop.

But it still tends to become a reporting tool unless it supports actual experimentation: changing prompts, comparing configurations, running scoring functions, and surfacing differences in behavior.

4. The real unlock is a playground plus scoring around failure modes

At 12:39–14:11, Phil describes a mature experimentation layer: give technical and non-technical users a sandbox where they can tweak agent parameters, compare configurations, and run evals over the results.

The important design principle: start from failure modes. Don’t build generic dashboards first. Identify how the agent can fail, then build scoring functions around those failures.

5. Production traces are messy enough to become a systems problem

At 16:43–20:52, the talk shifts from product workflow to infrastructure. Agent traces are not normal app traces: they can be semi-structured or unstructured, full of long text, very large, high velocity, and queried in multiple ways.

That creates competing requirements:

low-latency ingestion so users can inspect traces immediately
aggregate analytics across large datasets
full-text search across millions of traces
support for large spans, sometimes tens of megabytes
enough structure for SQL, agents, and automated analysis

This is the part that makes “eval platform” more than a UI problem.

6. Future eval platforms should be built for agents too

At 21:23–23:30, Phil points at headless workflows: coding agents reading eval data, identifying weaknesses, and proposing or making improvements to the target agent. That only works if the eval backend exposes data in a way agents can query and reason over — for example, through SQL-like access rather than only a visual dashboard.

Best timestamped moments

4:52 — LLMs are valuable because they are variable, but that same variability creates risk. Agent quality requires confidence under uncertainty.
5:53 — The “Google Sheet eval” stage is validated. Starting crude is better than avoiding evals entirely.
7:26 — The iceberg slide: looping through inputs and recording scores is only the visible tip of eval infrastructure.
8:29 — Evals are a multi-persona problem. Product engineers, AI engineers, systems engineers, and domain experts all need to participate.
11:03 — “I can just vibe code [an eval platform]” is treated as both understandable and insufficient.
13:09 — A useful eval platform lets users compare configurations and score behavior, not just read outputs.
14:11 — The eval/observability flywheel: production traffic becomes the source of better offline evals.
16:43 — If you build it, you own it. Internal eval platforms become permanent infrastructure obligations.
17:14 — Agent traces are “nasty”: big, semi-structured, text-heavy, and unlike normal traces.
19:20 — The key systems claim: measuring agent quality is not just UI/UX; the data layer is the hard part.
21:23 — Eval platforms need to support coding-agent workflows, not just human dashboards.
22:29 — The next frontier is surfacing unknown unknowns through topic modeling and automatic analysis.
24:31 — For multimodal traces, Braintrust stores assets in object storage and references them inside the trace UI.

Practical takeaways / recommended workflow

1. Start with the simplest eval loop. Maintain a set of input examples, run the agent, capture outputs, and score them.

2. Graduate from documentation to experimentation. Track runs, compare configurations, and make it easy to test prompt/model/tool changes.

3. Invite domain experts early. They often understand failure quality better than engineers do.

4. Define failure modes explicitly. Turn them into scoring functions rather than relying only on generic pass/fail labels.

5. Connect production observability to offline evals. Sample real traces, label failures, and recycle them into regression suites.

6. Design for ugly trace data. Expect large text payloads, semi-structured spans, full-text search needs, and both low-latency and aggregate query paths.

7. Expose eval data to agents. If coding agents will help improve agents, the eval system needs machine-readable access, not only dashboards.

8. Plan ownership before building in-house. A homegrown eval UI can accidentally become a long-term platform team commitment.

Comment insights

The comment section is small but unusually pointed.

Agreement / appreciation

There is light agreement that evals are a vital topic. One commenter calls it a “great talk on such a vital topic,” which matches the audience interest in the room: evals are now a mainstream concern for people shipping agents.

Disagreement / pushback patterns

The strongest reaction is skepticism about vendor framing. The top-liked comment says the talk claimed it would not be a sales talk but “feels like a sales talk.” Another commenter generalizes that when someone says they will not do something, they often proceed to do it anyway.

That matters because eval-platform advice from a vendor sits in a trust gap: even when the concepts are useful, viewers may discount the message if it feels like category education that conveniently points back to the vendor’s product.

Practitioner additions

No commenters added substantial implementation details, architectures, tools, or workflows beyond what was in the talk. The extracted comments are more about reception and framing than extra practice.

Memorable phrases from comments

“Feels like a sales talk.”
“Great talk on such a vital topic.”
“If people say they won’t do something, they definitely going to do it.”

Caveats raised by commenters

One commenter pushes back on the claim that “we love LLMs because they are highly variable,” arguing that variability also generated “tangent sectors” because systems are not reliable. The useful caveat: variability is not automatically a feature. It is only valuable when paired with strong evaluation, observability, and control loops.

Concrete tools/workflows mentioned by commenters

None beyond the video’s own concepts. The talk itself mentions spreadsheets, for-loops, playgrounds, production tracing, scoring functions, online evals, alerting, object storage for multimodal assets, SQL-accessible backends, topic modeling, RBAC, data masking, and AI gateways/proxies.

My read / why it matters

This talk is most useful as a maturity model. The headline is not “buy an eval platform”; it is “your eval stack will become production infrastructure if your agent matters.”

The practical warning is sharp: teams often underestimate evals because the first version is easy. A spreadsheet works. A quick UI works. But once the agent is in production, the hard questions become data questions: What actually happened? Can we search it? Can we score it? Can we replay it? Can we compare regressions? Can domain experts inspect it? Can an agent use it to improve the system?

The comment pushback is fair too. The talk has vendor gravity. But the underlying argument holds: if agents are becoming customer-facing software, evals and observability need to merge into a continuous quality loop.