Judge the Judge: Building LLM Evaluators That Actually Work with GEPA — Mahmoud Mabrouk, Agenta AI
Actionable Insights
- Start with domain-specific eval dimensions, not generic hallucination. Create
evals/rubric.mdwith 3-6 criteria tied to your product policy: policy adherence, tool correctness, escalation quality, factual grounding, tone. Have subject-matter experts label 50-200 traces with pass/fail and reasons before optimizing prompts. Evaluate correlation with human labels, not vibes. - Build a labeled trace set before deploying an LLM judge. Export traces from Phoenix/Agenta/LangSmith/OpenTelemetry, sample failures and successes, and annotate reason codes. Minimum experiment: 100 traces, stratified by failure mode, with two reviewers on 20% overlap. Metrics: agreement, precision/recall by failure type, false-negative cost.
- Optimize judge prompts with GEPA/DSPy only after you have labels. Use GEPA-style reflective prompt optimization or DSPy optimizers when you can score candidates against labels. Relevant tools: DSPy GEPA docs (https://dspy.ai), Agenta (https://github.com/Agenta-AI/agenta), and the GEPA paper/project. Caution: prompt optimization can overfit small labels; keep a holdout set.
- Use Pareto-style selection across failure cases. Do not pick only the average-best judge. Track which candidate catches which failure cluster, then merge/refine prompts to cover the frontier. Evaluation criterion: improved macro-F1 and lower miss rate on critical failures, not just accuracy.
- Deploy judges as monitoring signals, not absolute truth. In production, route low-confidence or high-impact judge decisions to human review. Use drift dashboards: judge score distribution, disagreement with user reports, and sample audits. Caution: a calibrated judge can still fail under distribution shift.
Core thesis
Generic “hallucination judge” prompts are weak; useful LLM judges must be calibrated against human-labeled, use-case-specific data and validated like any other model component.
Big ideas / key insights
- The valuable pattern is not “let the agent run longer”; it is to make the work inspectable, measurable, and interruptible.
- The transcript evidence points to concrete workflow design: artifacts, traces, evals, policies, or specs that survive a single chat context.
- The comment evidence is used as a sanity check: where practitioners push back, the verdicts below are deliberately more conservative.
- The strongest practical takeaway is to convert the creator’s idea into a small pilot with explicit success/failure criteria before standardizing it.
Best timestamped moments
- 0:31 — The opening critique: a generic hallucination prompt cannot know whether an app output hallucinated without task context.
- 1:31 — Fast eval loops are only valuable if the judge correlates with human annotation.
- 3:02 — The “data flywheel” claim: traces become new evals, which accelerate product iteration.
- 5:35 — The workshop uses τ-bench / customer-support traces and policy-adherence labels.
- 8:10 — Metric design must come from the use case and subject-matter experts.
- 15:28 — GEPA workflow: sample candidates, evaluate, filter, iterate.
- 18:01 — Pareto frontier selection avoids optimizing only average score.
Practical takeaways / recommended workflow
- Create the durable artifact first. Write the spec/rubric/policy/trace schema before letting agents perform expensive work.
- Run a constrained pilot. Pick one repository, one team, or one workflow; record baseline cost, latency, failure rate, and review time.
- Instrument the loop. Capture traces, commands, tool calls, test results, and human corrections so the workflow can be evaluated later.
- Add gates. Require acceptance tests, human approval for sensitive actions, and rollback paths before allowing broader automation.
- Review after 5-10 runs. Keep the practice only if it improves measurable outcomes, not just because the demo felt compelling.
Comment insights
Few comments, but they are positive: viewers call it one of the best lectures on LLM-as-judge and GEPA. There is little substantive pushback in comments, so external evidence matters more here.
Deep research
- LLM-as-a-Judge literature. Research such as Zheng et al. “Judging LLM-as-a-Judge” and later eval work supports that LLM judges can correlate with humans but have biases and require validation.
- DSPy / GEPA. GEPA is a reflective prompt optimization method exposed in DSPy for improving prompts/programs using feedback. Source: https://dspy.ai
- τ-bench / Sierra. τ-bench is a benchmark for tool-agent behavior in realistic domains such as airline and retail; it supports the workshop’s customer-support policy setting.
- OpenTelemetry / Phoenix / Agenta docs. Trace-based evals are established observability practice for agent systems; Agenta and Arize Phoenix both support traces/evals.
Evidence quality note: research here uses named public documentation, standards, and widely known project sources where available. Some vendor claims are treated as product claims unless independently benchmarked in the user’s environment.
Verdicts
- Generic hallucination judges are insufficient: Agree / high confidence. Without reference context, policy, or labels they produce weak signals.
- Calibrated LLM judges can accelerate eval loops: Agree / medium-high confidence. Supported when labels and holdouts exist; overclaimed if “human-quality” is assumed without calibration evidence.
- Automated data flywheels are the holy grail: Mixed / medium confidence. Directionally right, but automation must include sampling, human audits, and drift controls.
Screen-level insights
Frames show slides with a weak hallucination-judge prompt, eval-loop diagrams, the τ-bench/customer-support dataset, annotation queues, and GEPA candidate/pareto-frontier diagrams. The visual step matters because the talk is algorithmic: the diagrams clarify candidate mutation, filtering, and coverage better than transcript alone.
Representative extracted frame anchors checked against transcript context:
- 0:31 — image
youtube-extract/X4dEHRzBLmc/frames/000_000031.jpg; transcript context: agent is not working and you look at the traces, it’s not working. You look now under the hood about this hallucination LLM as a judge and you’ll find a prompt not very far from this one. You’ll be given an LLM output rate whether it’s in a hallucination. Make no mistakes. Now, obviously, how the hell would the agent know whether it’s a hallucination? If it - 1:31 — image
youtube-extract/X4dEHRzBLmc/frames/001_000091.jpg; transcript context: good agent or a good prompt the way you do is you try to experiment with a prompt then run your AV valves see if it improves things or not. If it does good if it does not you go back and you improve it a little bit. Prove the harness the prompt and do it again and again. And the speed in which you move to production or add features is actually the speed into - 3:34 — image
youtube-extract/X4dEHRzBLmc/frames/003_000214.jpg; transcript context: have a way to kind of add new evaluations quickly obviously automatic evaluations um from the traces from kind of the annotations and data you can go through this loop faster and faster to to the moment or to the point that you can think of it as an automatic loop, right? Because you can optimize the harnesses with optimization techniques like GIA what we’re - 4:34 — image
youtube-extract/X4dEHRzBLmc/frames/004_000274.jpg; transcript context: experience is in machine learning. I have more than 15 years experience in that. In a previous life, I was in academia. uh worked on machine learning applied to computational biology, protein structure prediction. And right now we’re working a lot on these sampling and autooptimization workflows. So uh if you’re interested in that, please reach out. We’d lov - 5:35 — image
youtube-extract/X4dEHRzBLmc/frames/005_000335.jpg; transcript context: and obviously then validating the results. All the code and the data used in this uh will be found in GitHub and you can find them in the links in this video and the last slide. So let’s start with the data set. We’re going to use Towbench. TBench is a benchmark in a large data set built by Sierra uh customer support um scaleup I think and they have like mul - 6:38 — image
youtube-extract/X4dEHRzBLmc/frames/006_000398.jpg; transcript context: have is is the agent itself. about most most importantly um 599 conversation traces that are generated with annotations. uh now the format the original format of the annotations is like in the format of assertion but uh I pre-processed or by post-processed the data so that we have for each trace an annotation like a human annotation uh where it says for exam - 9:14 — image
youtube-extract/X4dEHRzBLmc/frames/008_000554.jpg; transcript context: going to share his uh blog and uh in the YouTube video and really describe this idea of error analysis very well but but I’m going to go over it very quickly and also the annotation workflow very quickly. So the idea is that you you provide your subject matter expert with all these traces of uh the trajectories of the conversation and they would annotate the - 11:48 — image
youtube-extract/X4dEHRzBLmc/frames/010_000708.jpg; transcript context: agenta basically um uh you would take your traces create an annotation queue and and kind of specify for your annotator like uh the name of the um the feedback or the evaluator policy adherance here and then providing like they should provide each one with whether it adears to the policy whether it does not and provide the reasoning and the reasoning here is - 15:28 — image
youtube-extract/X4dEHRzBLmc/frames/013_000928.jpg; transcript context: we’re going to go uh and look at each step. So it’s three steps basically you sample new candidate each times evaluate them see which one good work well and then do some filtering using this kind of parto frontier I’m going to talk about and then do it again and again. So let’s see how it works. So the way uh it works is first you start with a seat candidate - 16:28 — image
youtube-extract/X4dEHRzBLmc/frames/014_000988.jpg; transcript context: and the other one is merging multiple candidates. For prompt mutation which is what we’re going to use in the beginning since we have only one candidate. The idea is that you would run the first LLM as a judge here um whether the trajectory and uh if it fails uh it’s like this LLM as a judge will reflect and propose a new prompt. Basically there will be some - 18:01 — image
youtube-extract/X4dEHRzBLmc/frames/016_001081.jpg; transcript context: iteration which is the other uh innovation of this algorithm which is the idea of the paral frontier. Basically, the way we select which prompts or which candidates we’re going to use as a seed for the new iteration is not that we select the ones that have the average best score. Like that would be the trivial um solution, right? Look at my my prompt, see wh - 19:02 — image
youtube-extract/X4dEHRzBLmc/frames/017_001142.jpg; transcript context: from these and basically what you do is you try to select a set at the end of the day that covers your whole test case. So basically for for each test case there is at least one candidate that solves it and obviously you see there that the idea is that you get like a good par frontier and then you start merging things and at the end of the day you have this
My read / why it matters
This video is useful if you convert it into an operating procedure rather than copying the headline. The durable lesson is about control surfaces for AI work: specs humans read, traces teams audit, evals that catch regressions, identity policies that revoke access, or graphs that preserve provenance. The risky version is adopting the slogan without the measurement and governance layer.
Verification notes
- Source/evidence audit: Checked the extracted transcript/comment packet and named external sources/docs relevant to the main claims. Vendor/tool links are identified as vendor/project sources, not neutral proof of effectiveness.
- Transcript/comment/frame fidelity audit: Timestamped moments and comment insights were kept close to extracted evidence in
youtube-extract/X4dEHRzBLmc/and the draft packet. Screen claims are limited to the extracted key-frame metadata and visible UI descriptions; for-QFHIoCo-Ko, no frame-derived claims are made because key frames were not extracted. - Hallucination/overclaim audit: Headline claims were softened where evidence was insufficient. Verdicts explicitly mark mixed/low-confidence claims and separate practical heuristics from proven facts.
- Actionable Insights audit: The top section was checked for executable first steps, tools/commands or links where available, evaluation criteria, and cautions. Generic summary bullets were rewritten as workflow steps.
- Residual uncertainty: I did not have independent benchmark results for the specific demos, and several claims would need local measurement before adoption. Transcript extraction status was marked unknown by the extractor, so the analysis relies on the processor’s excerpted transcript evidence rather than a full raw transcript page.