Ship Real Agents: Hands-On Evals for Agentic Applications — Laurie Voss, Arize
Actionable Insights
- Instrument traces before writing evals. Add OpenTelemetry/Phoenix tracing to every agent turn, LLM call, and tool call. Tools: Arize Phoenix (https://phoenix.arize.com / https://github.com/Arize-ai/phoenix), OpenTelemetry (https://opentelemetry.io). Evaluation: every failed run should expose inputs, outputs, tools, latency, tokens, and errors.
- Do error analysis before metrics. Sample 30-100 traces and tag failure modes: wrong tool, missing context, hallucinated citation, bad user vocabulary, timeout, unsafe action. Only then write evals. This follows the workshop’s “look at the data” step that tutorials skip.
- Layer cheap deterministic checks with semantic LLM judges. Start with code checks: JSON schema, required fields, forbidden actions, URL validity. Add built-in faithfulness/relevance evals, then custom LLM judges for semantic criteria. Evaluate judge quality via meta-eval against human labels.
- Promote failing traces into datasets and experiments. Create a dataset from representative failures, then run experiments for each prompt/tool/model change. Gate merges on regression tests just like normal CI. Metric set: pass rate by failure class, cost, latency, precision/recall for classifiers.
- Teach agents user-vocabulary edge cases. The talk notes agents fail when users use unexpected vocabulary. Build a paraphrase/edge-case set from support logs and comments. Evaluate recall on “dumb” or imprecise inputs, not just expert prompts.
Core thesis
Agent teams should treat evals as tests powered by traces: inspect real failures first, then combine deterministic, built-in, LLM, meta-eval, dataset, and experiment loops before shipping.
Big ideas / key insights
- The valuable pattern is not “let the agent run longer”; it is to make the work inspectable, measurable, and interruptible.
- The transcript evidence points to concrete workflow design: artifacts, traces, evals, policies, or specs that survive a single chat context.
- The comment evidence is used as a sanity check: where practitioners push back, the verdicts below are deliberately more conservative.
- The strongest practical takeaway is to convert the creator’s idea into a small pilot with explicit success/failure criteria before standardizing it.
Best timestamped moments
- 0:44 — Workshop roadmap: tracing, running an agent, reading traces, writing evals.
- 1:14 — Most tutorials skip looking at trace data before writing metrics.
- 1:46 — Meta-evaluation: test whether judges judge correctly.
- 3:49 — Phoenix Cloud is used for log data/traces.
- 5:54 — Evals are tests; traces are log data for AI systems.
- 6:55 — The vibes problem: manual spot checks miss edge cases and regressions.
- 7:25 — Agents fail on unexpected user vocabulary.
- 8:28 — Faithfulness evals catch misuse of source material.
- 1:32:47 — Datasets and experiments support iterative improvement.
Practical takeaways / recommended workflow
- Create the durable artifact first. Write the spec/rubric/policy/trace schema before letting agents perform expensive work.
- Run a constrained pilot. Pick one repository, one team, or one workflow; record baseline cost, latency, failure rate, and review time.
- Instrument the loop. Capture traces, commands, tool calls, test results, and human corrections so the workflow can be evaluated later.
- Add gates. Require acceptance tests, human approval for sensitive actions, and rollback paths before allowing broader automation.
- Review after 5-10 runs. Keep the practice only if it improves measurable outcomes, not just because the demo felt compelling.
Comment insights
Only substantive comment asks for the notebook, which reinforces demand for runnable artifacts. There is not much comment-derived critique, so the analysis relies on transcript and external eval practice.
Deep research
- Arize Phoenix docs. Phoenix provides open-source tracing, datasets, experiments, and eval workflows for LLM apps. Source: https://github.com/Arize-ai/phoenix
- OpenTelemetry GenAI instrumentation. OpenTelemetry provides standard tracing concepts for spans/attributes; GenAI semantic conventions are emerging.
- RAG/eval literature. Faithfulness/groundedness/relevance evals are common but imperfect; LLM judges need calibration.
- DORA/CI testing analogy. Regression gates are valuable only when they represent real failure modes and run continuously.
Evidence quality note: research here uses named public documentation, standards, and widely known project sources where available. Some vendor claims are treated as product claims unless independently benchmarked in the user’s environment.
Verdicts
- Evals are tests powered by traces: Agree / high confidence.
- Human spot-checking does not scale: Agree / high confidence.
- LLM judges are necessary for agent evals: Mixed / medium confidence. Useful for semantic checks, but should be paired with deterministic checks and meta-eval.
Screen-level insights
Frames show the workshop agenda, Phoenix setup, precision/recall explanation, and Phoenix experiments UI. The visual step matters because hands-on evals require exact UI/data workflow, not abstract advice.
Representative extracted frame anchors checked against transcript context:
- 0:44 — image
youtube-extract/Xfl50508LZM/frames/000_000044.jpg; transcript context: eval are uh why you need them and why agents make evaluation harder than a simple LLM call is. Uh and then we’re going to set up tracing uh which is how you capture the raw data that you need to run evals in the first place. Uh we’re also going to uh run a simple AI agent with the claw a agent SDK uh and look at the traces that it produces. Once we’ve looked - 3:18 — image
youtube-extract/Xfl50508LZM/frames/002_000198.jpg; transcript context: evaluating this agent that I’ve already written. Um we’re also going to be using Claude both to power the agent and to power the evals. Uh I picked claude because everybody seems to have switched to cloud in the last couple of months. like I’m hoping you have a cloud AP a cloud account already so you don’t have to sign up for an API key right now but if you - 1:26:09 — image
youtube-extract/Xfl50508LZM/frames/054_005169.jpg; transcript context: great for spam for instance, uh because you don’t want to send a real email to spam and you were okay with getting a certain amount of actual spam uh in exchange for not doing that. Uh but recall is the opposite. It is out of the real positives and the misses, what percentage uh were really positive. Uh this is uh for this one to make it go up you want to mi - 1:32:47 — image
youtube-extract/Xfl50508LZM/frames/057_005567.jpg; transcript context: what uh experiments are for. So for this we go to a completely different part of the Phoenix UI. We go to the Whoops. There we go. You didn’t see that. Uh we go to our experiments uh evaluation. To do that uh I’m going to you can go to to produce your data set. uh you go to your uh uh to your traces and you take for instance a bunch of failing traces uh
My read / why it matters
This video is useful if you convert it into an operating procedure rather than copying the headline. The durable lesson is about control surfaces for AI work: specs humans read, traces teams audit, evals that catch regressions, identity policies that revoke access, or graphs that preserve provenance. The risky version is adopting the slogan without the measurement and governance layer.
Verification notes
- Source/evidence audit: Checked the extracted transcript/comment packet and named external sources/docs relevant to the main claims. Vendor/tool links are identified as vendor/project sources, not neutral proof of effectiveness.
- Transcript/comment/frame fidelity audit: Timestamped moments and comment insights were kept close to extracted evidence in
youtube-extract/Xfl50508LZM/and the draft packet. Screen claims are limited to the extracted key-frame metadata and visible UI descriptions; for-QFHIoCo-Ko, no frame-derived claims are made because key frames were not extracted. - Hallucination/overclaim audit: Headline claims were softened where evidence was insufficient. Verdicts explicitly mark mixed/low-confidence claims and separate practical heuristics from proven facts.
- Actionable Insights audit: The top section was checked for executable first steps, tools/commands or links where available, evaluation criteria, and cautions. Generic summary bullets were rewritten as workflow steps.
- Residual uncertainty: I did not have independent benchmark results for the specific demos, and several claims would need local measurement before adoption. Transcript extraction status was marked unknown by the extractor, so the analysis relies on the processor’s excerpted transcript evidence rather than a full raw transcript page.