← Back to library

Agentic Evaluations at Scale, For Everybody — Nicholas Kang & Michael Aaron, Google DeepMind / Kaggle

AI Engineer20:02Transcript ✅Added May 26, 8:40 pm GMT+8

Source files: youtube-extract/Ubwb6NzegyA/Ubwb6NzegyA-extraction.md, youtube-extract/Ubwb6NzegyA/Ubwb6NzegyA-frames.json
Generated: 2026-05-26

Actionable Insights

  1. Create a quick “agent exam” before giving any personal or production agent real authority. The Kaggle team’s standardized-agent-exam idea is simple: paste a one-line prompt into your agent, let it take an exam, and compare the score (9:41–10:43). Recreate this locally with evals/agent_exam/ containing 20–50 tasks that cover inbox, browser, file, calendar, purchase, and refusal behavior. First step: write tasks as JSON/YAML with prompt, allowed_tools, expected_state, must_not_do, and grader. Evaluate by running the same agent version three times and checking pass rate variance. Caution: do not treat a public leaderboard as proof your private workflow is safe; include your own data boundaries and tool permissions.

  2. Publish benchmark configuration, not just scores. The talk’s strongest criticism is that benchmark results often hide orchestration details, model settings, compaction, and harness differences (3:04–3:35). For each internal eval, save a benchmark-card.md with model ID, date, tool list, system prompt hash, context-compaction strategy, retry policy, grader version, and sample traces. This makes results reproducible and prevents “optimized for my model” comparisons. Tools/sources to inspect: Kaggle competitions, OpenSpiel, Anthropic agent eval guidance, and LiveBench for anti-contamination/live-eval ideas.

  3. Use pairwise/PvP evals only when the win condition is natural. Kaggle’s Game Arena uses PvP games, Elo-like ratings, and Bradley–Terry pairing to avoid saturation (6:08–6:38, 14:16–15:18). This is powerful for games, debates, preference tasks, and simulations with clear winners. First experiment: pick one domain with an objective win/loss or preference judgment, run pairwise matches, then fit a simple Bradley–Terry/Elo ranking. Success criteria: rankings are stable under resampling and correlate with downstream task outcomes. Caution: PvP can measure harness/prompt behavior rather than the underlying model, and it can be expensive — the talk cites ~400k poker hands for statistical significance.

  4. Recruit domain experts to build long-tail benchmarks. The wastewater-treatment example (4:37–5:38) is the most important practical idea: many valuable AI failures occur in domains AI labs do not prioritize. Create a benchmark intake form for SMEs: “What safety incident matters?”, “What expert-only data exists?”, “What would a novice get wrong?”, “What action would be dangerous?” Then convert answers into eval tasks with expert grading. Evaluate success by whether the benchmark catches failures in models/agents that look strong on generic leaderboards.

  5. Separate model quality from harness quality. Michael notes a SWE-bench-style concern: frontier models may be within a few points while the harness can shift performance by much more (18:23–18:53). For coding/agent benchmarks, run an A/B matrix: same model across harnesses and same harness across models. Store traces and environment diffs. Practical criterion: do not announce “model X is best” unless the harness, tool budget, and prompt policy are fixed and disclosed.

Core thesis

AI evals are too scattered, stale, opaque, and lab-centric. Kaggle/DeepMind want more community-built, transparent, reusable, and scalable eval infrastructure: hackathons for benchmark creation, standardized agent exams for consumer agents, PvP game arenas for unsaturated leaderboards, and a benchmark platform for domain-specific tasks.

Big ideas / key insights

  • Static leaderboards decay quickly because models, harnesses, and benchmark papers move faster than the evaluation infrastructure (2:01–2:33).
  • Reproducibility requires benchmark setup details, not just final scores (3:04–3:35).
  • AI will affect domains that AI researchers do not understand deeply; therefore domain experts must be able to contribute benchmarks (4:07–5:38).
  • PvP/game arenas can avoid simple saturation, but they introduce high cost and statistical-significance challenges (12:13–15:18).
  • Agent benchmarks often test the harness as much as the model; this must be disclosed and controlled (18:23–18:53).

Best timestamped moments with interpretation

  • 2:01–2:33 — The claim that 10+ benchmarks can appear daily and become stale quickly is a practical diagnosis: evaluation discovery and maintenance are themselves infrastructure problems.
  • 3:04–3:35 — The anecdote about compaction differences changing benchmark results shows why config transparency matters.
  • 4:37–5:38 — The wastewater benchmark story gives a strong argument for democratized domain-specific evals: experts can encode safety and tacit knowledge that generic labs miss.
  • 9:41–11:13 — Standardized agent exams target consumer agents that people may let into inboxes, shopping accounts, and other personal workflows without testing.
  • 14:16–15:18 — The Game Arena architecture — proxy, simulation platform, Bradley–Terry pairing, published traces/datasets, visualizer — is the most implementation-heavy part of the talk.
  • 18:23–18:53 — Harness-versus-model ambiguity is the key caveat for agent benchmarks.
  1. Build a small local agent exam before deployment.
  2. Write benchmark cards for every eval run.
  3. Publish traces where privacy allows; at minimum preserve them internally.
  4. Use domain experts to create long-tail tasks.
  5. Run enough trials to estimate variance; do not compare single noisy runs.
  6. A/B harnesses separately from models.
  7. Avoid leaderboards that hide prompts, compaction, tools, or retries.

Comment insights

Only one comment was extracted: “When do you guys come to Paris?” It adds no technical critique, but it does suggest audience interest in seeing the Kaggle/DeepMind eval work presented in more locations. There were no extracted practitioner additions or substantive pushback in comments.

Deep research

  • Support: agent evals need traces, graders, and repeated trials. Anthropic’s agent eval writeup defines tasks, trials, graders/assertions, and transcripts/traces, and notes that multi-turn tool-using agents can have compounding failures. This supports the talk’s push toward richer benchmark infrastructure.
  • Support: live/anti-contamination benchmarks are a known response to stale leaderboards. LiveBench describes itself as a benchmark designed with test-set contamination and objective evaluation in mind. This aligns with the talk’s complaint about stale/static benchmark results.
  • Support: benchmark design quality is an active research topic. The arXiv paper “Establishing Best Practices for Building Rigorous Agentic Benchmarks” argues that many agent benchmarks depend on reward design and need stronger rigor; this supports the speakers’ caution about what is actually being measured.
  • Support: Kaggle/DeepMind AGI hackathon exists. Kaggle’s “Measuring Progress Toward AGI - Cognitive Abilities” page says participants should design high-quality benchmarks beyond recall and focus on faculties highlighted in Google DeepMind’s cognitive framework.
  • Contradicting/qualifying evidence: Community benchmarks can be noisy, gamed, or inconsistent. Expert adjudication is expensive and hard to align, as the speakers themselves concede (9:11). PvP avoids saturation only when the game remains relevant to the target capability; it can become a benchmark of strategy prompts or tool wrappers rather than real-world usefulness.

Verdicts on major claims

  • Claim: AI evals are scattered, stale, and opaque. Verdict: Agree, high confidence. The transcript provides concrete examples; external sources on live benchmarks and agent-eval rigor support the broader issue.
  • Claim: Democratized benchmark creation is necessary for long-tail domains. Verdict: Agree, medium-high confidence. The wastewater example is compelling and plausible, and domain-specific risk is underrepresented in general benchmarks. Caution: open contribution requires quality control, privacy handling, and expert review.
  • Claim: Standardized consumer-agent exams are useful. Verdict: Agree, medium confidence. A quick baseline is better than no testing before giving agents access to inboxes/accounts. Overclaim risk: a generic exam cannot certify safety for every personal workflow.
  • Claim: PvP/game arenas are evergreen and unsaturated. Verdict: Mixed, medium confidence. Pairwise competition reduces saturation for game-like tasks, but cost, model churn, prompt/harness confounds, and limited ecological validity remain.
  • Claim: Harness can matter more than model for coding/agent performance. Verdict: Agree, medium-high confidence. The transcript cites a March 16 Morph/LLM blog example; broader SWE-bench ecosystem experience also supports harness sensitivity. Practical takeaway: publish harness details and run model/harness matrices.

Screen-level insights

  • 0:00 — black opening frame. No content; it only marks video start.
  • 0:30 — title slide. Visible text says “Agentic evaluation at scale — for everybody,” with Kaggle/DeepMind branding. This confirms the talk’s public-good/community framing.
  • 11:13 — “Initial first impressions for SAE” slide. The slide contrasts lab/enterprise evals with many consumer agents lacking evals, and shows traffic/community screenshots. This visual supports the claim that there is user interest in agent exams.
  • 15:18 — leaderboard/cost slide. The slide title references “Cost per Turn / ELO” and lists challenges including expensive O(N²) matchups, 400k+ poker hands, community involvement, and old-model disappearance. This is direct visual evidence for the cost/statistical-significance caveat.

The visual step matters because several important claims are encoded in slides rather than fully detailed in the transcript: the consumer-agent gap, the leaderboard/cost framing, and the relationship between ranking and cost.

My read / why it matters

This is one of the more practically important eval talks because it connects four usually separate worlds: academic benchmarks, production evals, consumer agents, and domain-expert knowledge. The strongest idea is not “Kaggle will solve evals”; it is “make evaluation contribution and replay easier for people outside AI labs.” The main caution is that democratization without methodology can create more noisy leaderboards — benchmark cards, traces, and expert review are non-negotiable.

Verification notes

I checked transcript chunks, comments, frames JSON, and visual frame analysis. External research included Anthropic’s agent-eval framework, LiveBench, Kaggle’s AGI benchmark competition page, OpenSpiel, and an arXiv paper on rigorous agentic benchmarks. Corrections made: softened “evergreen” PvP claims, added harness-confound cautions, and made Actionable Insights directly executable with files, schemas, tools, and evaluation criteria. Residual uncertainty: Kaggle pages were partly JavaScript/recaptcha-limited via fetch, so the Kaggle source is cited at page-title/snippet level rather than fully extracted page text; comment evidence was essentially non-technical.