The maturity phases of running evals — Phil Hetzel, Braintrust

AI Engineer18m 33sTranscript ✅Added May 27, 10:31 pm GMT+8

Speaker: Phil Hetzel, Braintrust
Duration: 18:33
Generated from local transcript/comment/keyframe extraction on 2026-05-27.

Actionable Insights

Start with documented vibe checks, then turn the justifications into criteria. Phil’s first maturity stage is deliberately pragmatic: run roughly 10 realistic inputs through the agent, have a builder or subject-matter expert mark thumbs-up/thumbs-down, and require a short justification for every rating. Do not treat this as “just vibes”; treat it as a data-capture step. Create a small eval_seed_set.jsonl with fields like input, output, rating, justification, failure_mode, and reviewer. The first success metric is not automation; it is whether reviewers can explain what “good” and “bad” mean consistently enough to reuse later.
Mine human justifications into named failure modes before writing judges. Around 9:58, Phil suggests using tools like Cursor, Claude Code, or Codex over the human review notes to derive the actual failure modes behind thumbs-down ratings. A useful workflow is: export annotations, cluster the justifications, name 5–10 recurring failure modes, then write one scorer per high-risk mode. Keep the raw reviewer language beside each failure-mode definition so the scorer remains anchored to real domain expectations instead of abstract rubric prose.
Use LLM-as-judge only after you have a calibration set. Phil’s warning is blunt: putting “a robe and a cloak” on an LLM does not make it inherently trustworthy. Before relying on an LLM judge, build a small ground-truth set from human labels and measure agreement between the judge and the human decision. Track precision/recall by failure mode, not just one aggregate score. If the judge is directional but noisy, use it for trend detection and triage, not hard release blocking.
Separate subjective quality scorers from deterministic guardrail scorers. The talk explicitly supports both. Use deterministic code for things like tool-call count, token budget, JSON validity, latency, cost ceilings, forbidden actions, or whether required fields were filled. Use LLM judges for subjective qualities like helpfulness, tone, reasoning sufficiency, or whether a response met a domain-specific intent. This avoids wasting expensive judge calls on checks that plain code can perform more reliably.
Rerun production, don’t merely run tests. Phil’s strongest operational point is the eval flywheel: capture production or UAT traces, identify failures through humans or automated scoring, feed representative examples back into offline datasets, and rerun the real agent/prompt against them before shipping changes. In practice, add a weekly or CI job that samples recent production traces into an eval dataset, dedupes near-identical cases, preserves metadata, and compares candidate agent versions against the prior baseline.
For tool-using agents, evaluate traces and external state, not just final answers. Once an agent uses context-gathering tools or CRUD tools, the final answer can look fine while the path is dangerous or inefficient. Capture the full trace, tool arguments, tool outputs, token/cost data, and the relevant external-system state at the time of execution. For offline evals, use mocks, snapshots, versioned vector indexes, or timestamp queries so the agent does not mutate production systems while being replayed.

Core thesis

Evals mature from informal human judgment into a production feedback system. The path is not “write perfect tests for every possible failure.” Instead, teams start with documented human judgment, extract failure modes, automate the parts that can be automated, then close the loop by replaying production traces through offline experiments.

Phil’s central distinction is that evals are not unit tests. Unit tests are meant to be exhaustive over known deterministic behavior; agent evals are meant to provide directional confidence over a shifting set of real-world failure modes. That makes the quality of the dataset, rubric, trace capture, and production feedback loop more important than trying to enumerate every possible edge case.

Big ideas / key insights

Agent quality is both defensive and offensive. Evals reduce brand, legal, compliance, cost, and systems risk, but they also let teams see whether each prompt/model/tooling change actually improved the application.
The evaluation primitive is task + data + scores. Phil describes an eval as three parts: the task under test, the dataset/examples that invoke it, and scoring functions that judge quality or utility. This matches Braintrust’s documentation, which defines evaluation anatomy as Data, Task, and Scores.
Early human annotation is not a waste. Manual thumbs-up/thumbs-down plus justification extracts domain knowledge from experts. That domain knowledge becomes the basis for automated judges and failure-mode-specific scorers.
LLM judges need their own evals. The Q&A reinforces this: subjective work often requires LLM-as-judge, but teams should evaluate the judge against human labels. Phil frames it as “eval the eval.”
Tool use raises the maturity bar. Once agents call tools, especially tools that create/update/delete external records, evals need traces, mocks, snapshots, and state reconstruction. Output-only scoring becomes insufficient.

Best timestamped moments with interpretation

3:19–4:19 — Phil frames evals as a response to real usage and real users. The important move is connecting evals to risk management and improvement tracking, not just correctness.
4:19–5:22 — The “evals are not unit tests” section is the talk’s conceptual anchor. The practical interpretation: prioritize known failure modes and directional trend measurement over impossible exhaustive coverage.
6:54–8:57 — The first maturity stage: vibe checks are acceptable if they are documented. The important artifact is the reviewer justification because it can later become training data, rubric material, or judge calibration data.
9:58–11:29 — The second maturity stage: derive failure modes, use LLM-as-judge where appropriate, and keep deterministic checks for objective failure modes such as too many tool calls or tokens.
11:29–12:29 — The production flywheel: capture production or UAT traces, understand what went wrong, bring those examples into an offline experiment, and use evals to decide which direction to improve the agent.
13:01–16:08 — Tool-using agents require trace-level evaluation and state-aware replay. Phil calls out the hard part: reproducing external system state and avoiding destructive offline CRUD calls.
17:11–18:13 — In Q&A, Phil says deterministic graders deserve respect, but subjective agent behavior still benefits from LLM-as-judge if the judge itself is evaluated against human ground truth.

Practical takeaways / recommended workflow

Build a starter dataset from 10–30 realistic user inputs.
Capture agent outputs and have a domain expert rate each with a justification.
Cluster justifications into recurring failure modes.
Write deterministic scorers for objective constraints: schema validity, tool count, token budget, latency, cost, safety boundaries, and required citations/fields.
Write LLM judges only for subjective criteria and validate them against the human-labeled set.
Start logging production/UAT traces with enough context to replay them.
Promote interesting production failures into the offline eval dataset.
In CI or before release, compare candidate agent versions against the baseline on this dataset.
For tool-using agents, replay with mocks/snapshots and score intermediate tool calls, not just final prose.

Comment insights

Only one top comment was extracted, but it usefully captures the audience mood: a practitioner jokes that software teams spent decades asking people to test and follow good practices, then temporarily abandoned that discipline because an “anthropomorphic while-loop” could do useful work. The comment’s value is not a new technique; it is confirmation that the pendulum is swinging back toward engineering discipline. The implied pushback is against “vibe-only” AI development. Phil’s talk aligns with that sentiment, but with a softer stance: vibes are allowed at the start if they are documented and converted into eval criteria.

Deep research on the main claims

Claim 1: Evals and observability are closely related for agent quality.
Braintrust’s own evaluation documentation supports this framing: it describes an evaluation cycle that starts with playground iteration, moves to experiments and CI/CD, then scores production traffic and feeds interesting production traces back into datasets. That is essentially Phil’s flywheel: offline evals and online observability reinforcing each other. Verdict: supported, with the caveat that this is also Braintrust’s product framing, not a neutral taxonomy.

Claim 2: Evals are not unit tests and should focus on failure modes rather than exhaustive coverage.
This claim is well supported by the nature of LLM systems: outputs can vary, expected answers may be subjective, and regressions can be metric-specific. Braintrust’s documentation similarly says AI systems differ from traditional software because the same input can produce different outputs and there is rarely a single correct answer. Verdict: agree with high confidence. The practical underclaim is that deterministic checks still matter a lot for the non-subjective parts of an agent system.

Claim 3: LLM-as-judge is useful but must be evaluated.
The transcript directly includes Phil’s warning that an LLM judge is not trustworthy simply because it is acting as a judge. Braintrust documentation also lists LLM-as-a-judge as one scoring option alongside code-based/custom scorers. The broader best practice is to calibrate LLM judges against human labels and use them where subjective judgment is required. Verdict: agree with high confidence. Overclaim risk: treating judge agreement as universal quality rather than domain- and rubric-specific calibration.

Claim 4: Production traces should become offline eval data.
Braintrust’s docs explicitly describe feeding production traces back into datasets to improve offline test coverage, and Phil spends several minutes on the same flywheel. Verdict: agree with high confidence. The implementation caveat is privacy and state capture: production traces may contain sensitive data and may depend on external systems that are hard to replay safely.

Claim 5: Tool-using agents require trace-level and state-aware evaluation.
The transcript’s sections on context-gathering tools, CRUD tools, mocks, and timestamp/version queries are persuasive. Output-only evaluation misses wrong tool calls, excessive calls, unsafe mutations, and stale retrieval. Verdict: agree with medium-high confidence. Residual uncertainty: Phil says the problem is not completely solved, which is accurate; teams still need bespoke mocking/sandboxing/state-replay infrastructure.

Verdict

My verdict is agree, high confidence on the overall maturity model. The talk’s advice is practical because it avoids two common traps: pretending informal review is worthless, and pretending automated judges are magic. The best part is the sequence: document human judgment, derive failure modes, automate carefully, then use production traces to keep the eval set alive.

The main thing underplayed is governance around production trace reuse: privacy filtering, PII redaction, customer consent boundaries, and retention policy matter when production traffic becomes eval data. The talk also compresses a difficult platform problem into a few minutes; teams still need to decide how to snapshot external state, mock tools, and prevent eval runs from performing real destructive operations.

Screen-level insights

0:14 — The first extracted keyframe shows the sponsor slide with Braintrust, WorkOS, and OpenAI. It establishes the conference/sponsor context rather than technical content.
0:45 — The agenda slide lists Intro, Overview, Different stages of eval platform builds, and What’s next. This confirms that the talk is framed as a staged maturity model, not a product demo.
1:47 — The speaker-intro slide identifies Phil Hetzel as Head of Solution Engineering at Braintrust, with prior consulting/implementation background and Databricks business-unit experience at Slalom. This matters because the talk is grounded in enterprise implementation patterns: teams making GenAI POCs but struggling to productionize them.

The available frame extraction captured only three early frames, so later stage-specific slide visuals were not available for direct visual inspection. The transcript still provides the substance of the four-stage framework.

My read / why it matters

This is a useful 18-minute talk for teams that are stuck between “we eyeball outputs” and “we need a full eval platform.” The key reassurance is that the first step does not need to be fancy. A small, documented, human-reviewed dataset is enough to begin. The key warning is that automation without calibration can create false confidence.

For an engineering team, the immediate next move is to formalize the annotation loop and production-trace flywheel before buying or building too much platform. A lightweight but disciplined eval loop will expose which platform capabilities are actually needed: annotation UI, trace capture, scorer registry, CI integration, production sampling, mocks, or judge calibration.

Verification notes

Checked local extraction artifacts: transcript chunks, one extracted top comment, and three keyframes. Cross-checked the main evaluation anatomy and production-feedback claims against Braintrust’s public “Evaluate systematically” documentation, which describes Data/Task/Scores, offline experiments, CI/CD, online scoring, and feeding production traces back into datasets. Actionable Insights audit: the top section contains concrete first steps, artifacts to create, scorer choices, calibration guidance, and trace-replay cautions rather than generic summary bullets. Fidelity caveat: only one comment and three early keyframes were available from extraction, so comment and screen-level insights are necessarily limited. No unsupported install commands were included.