Build Agents That Run for Hours (Without Losing the Plot) — Ash Prabaker & Andrew Wilson, Anthropic

AI Engineer1h 15mTranscript ✅Added May 19, 12:40 am GMT+8

Actionable Insights

Implement the initializer + coding-agent harness before attempting multi-hour autonomous coding. Start with Anthropic’s public quickstart: https://github.com/anthropics/claude-quickstarts/tree/main/autonomous-coding and the Claude Agent SDK docs: https://code.claude.com/docs/en/agent-sdk/overview. Create an initializer prompt that writes feature_list.json, claude-progress.txt, init.sh, and an initial git commit. Then run subsequent coding sessions with a rule: pick one failing feature, run the smoke test, implement, verify end-to-end, update only the passes status, commit, and append progress notes. Evaluate success by whether a fresh session can run pwd, read progress, run ./init.sh, inspect git history, and resume work without asking you what happened. Caution: do not use a vague markdown TODO list as the only state; the talk and Anthropic blog both say JSON status files resist accidental overwrites better.
Use an adversarial evaluator instead of asking the builder to grade itself. For web apps, create a separate evaluator role with a harsh rubric and browser access through Playwright MCP, Puppeteer MCP, Claude for Chrome MCP, or your own Playwright tests. The evaluator should open the running app, click as a user, inspect console/network errors, capture screenshots, and return failures to the generator. First experiment: choose a tiny UI feature and require the generator and evaluator to write a done-contract.md before code changes begin: expected user action, visible result, edge cases, and exact tests. Success means the evaluator catches at least one issue that unit tests or code inspection would miss. Caution: an evaluator that reads the generator’s whole trace may inherit its rationalizations; the speakers recommend judging outputs and contracts more than sharing the builder’s full internal context.
Make “definition of done” a negotiated artifact. Before coding, have the generator propose tests and have the evaluator push back on weak scope, missing edge cases, and vague criteria. Store the result in files on disk, not only conversation context: done-contract.md, eval-rubric.json, and agent-run-log.jsonl. This mirrors the talk’s key innovation beyond a simple Ralph loop: the evaluator grades against a contract both sides agreed to, not the original fuzzy prompt. Use granular criteria because vague criteria produce vague critiques. Evaluate by checking whether each evaluator finding maps to a specific contract line and whether the generator can act on it without re-planning the whole app.
Treat browser-level E2E verification as mandatory for frontend/product agents. The talk’s strongest demo evidence is that apps looked done but failed when arrow keys, delete keys, physics loops, or route ordering were actually exercised. Add a smoke script that starts the app and performs one representative happy path before new work starts. Then add evaluator tasks that test keyboard controls, persistence, empty states, error states, and visual layout overlap. Tools to try: Playwright (https://playwright.dev/), Puppeteer (https://pptr.dev/), Claude Agent SDK tools, and browser MCP servers where available. A run passes only when the evaluator acts like a human user, not when the model says the code “looks right.” Caution: current browser automation/vision still misses some native modal and subtle visual issues, so keep human review for release gates.
Read traces as the main debugging loop, then update prompts/skills. The Q&A is blunt: “No, you got to read the whole thing.” Save raw agent transcripts and summarize them only after preserving enough evidence. Use grep or a separate analysis agent to find repeated failure points, but still inspect the critical spans manually. Convert learnings into CLAUDE.md, skill files, evaluator rubric changes, or prompt templates. Evaluation criteria: after a prompt/rubric edit, the next run should fail later, fail more specifically, or avoid the same mistake entirely. Caution: running more experiments without reading traces just compounds cost; the speakers treat trace reading like stack-trace debugging for agents.
Retune harness complexity as models change. Do not assume a harness pattern is permanently optimal. The talk says parts that were essential for Opus 4.5 became dead weight with later models; context resets, sprint decomposition, and evaluator cadence changed as model behavior improved. Keep a harness-decisions.md file with: model used, observed failure mode, scaffold added, metric improved, and removal condition. Re-test simplified variants whenever you upgrade models. This prevents cargo-culting stale scaffolding. The practical rule: the harness should fill current model gaps, not preserve yesterday’s workaround because it once helped.
Use this pattern first for greenfield or bounded brownfield tasks. The speakers are clear that the showcased generator/evaluator loop is especially strong for greenfield apps and expensive multi-hour demos; brownfield production systems need more project-specific tests, repo conventions, and human merge review. For existing products, start with low-risk workflows such as autonomous issue reproduction, test generation, PR review, UI regression triage, or docs updates. Use git worktrees so parallel agents do not trample one another. Pass/fail should be tied to existing CI, E2E tests, and reviewer acceptance, not only evaluator scores.

Core thesis

Prabaker and Wilson argue that long-running agents do not succeed because of model intelligence alone. They need harnesses: persistent artifacts, clean state, scoped roles, explicit verification, and feedback loops that let agents work across hours or days without losing context, prematurely declaring victory, or mistaking surface-level output for working software.

The talk evolves from Anthropic’s earlier initializer/coding-agent pattern to a more adversarial planner/generator/evaluator pattern. The practical message is simple but deep: give models the same scaffolding effective engineers use—plans, checklists, git history, tests, contracts, and reviewers—then tune that scaffolding to the current model’s failure modes.

Big ideas / key insights

Agents lose the plot for three reasons: context limits/rot, poor planning, and weak self-verification.
Compaction is not coherence. Summaries can drift; durable files and git history are more reliable handoff surfaces.
Self-evaluation is a trap. A builder tends to rubber-stamp its own work; a separate harsh evaluator can be tuned more easily.
Definition-of-done contracts matter. The generator and evaluator should negotiate testable criteria before implementation.
Subjective quality can be graded if you write down taste. For frontend work, Anthropic used rubrics around design, originality, craft, and functionality.
Harnesses co-evolve with models. Better models let teams remove some scaffolding, but they do not eliminate the need for verification and state.
Trace reading is core agent engineering. The speakers treat raw traces as the best way to find where model judgment diverged from human intent.

Best timestamped moments with interpretation

2:45–3:47 — The problem statement: finite context, context rot, planning weakness, and models judging half-baked work as complete. This is the foundation for the whole talk.
4:51–5:21 — The Claude Agent SDK is framed as the harness around the model: tools, MCP, subagents, permissions, project context, sessions, skills, and slash commands.
12:29–14:01 — The earlier long-running agent harness is explained: initializer creates feature_list.json, progress file, git repo, init script; coding agents work one feature at a time and verify with Puppeteer.
18:36–20:40 — The generator/evaluator pattern is introduced as a GAN-like adversarial loop. The key insight: it is easier to tune a critic to be harsh than a builder to be self-critical.
21:10–22:45 — Design taste becomes a rubric. The talk makes a strong point that “subjective” work can still be evaluated if criteria are explicit.
25:20–26:20 — The generator and evaluator negotiate what done means before building. This is the most reusable architecture idea in the talk.
30:24–31:56 — The evaluator catches real app failures by using the app, not just reading code. This is the difference between demo polish and functional product.
33:26–34:27 — The primary debugging loop is reading traces and updating prompts, not blindly running more experiments.
39:03–39:34 — The five takeaways are explicit: avoid self-evaluation, beware lossy compaction, use structured handoffs, grade subjective quality, and read traces.
53:58–55:30 — For long-lived products, leave breadcrumbs: JSON logs of tries, findings, fixes, and current state, plus lightweight docs/file structure.

Practical takeaways / recommended workflow

Initialize the repo with durable state: feature_list.json, claude-progress.txt, init.sh, git baseline, and testing instructions.
Run coding agents in bounded increments: one feature, one verification loop, one clean commit.
Add a separate evaluator with browser automation and a harsh rubric.
Require generator/evaluator agreement on done-contract.md before implementation.
Save raw transcripts/traces and read failures manually.
Convert repeated failures into rubric edits, skills, or project instructions.
Keep harness complexity under review when models or tasks change.
For brownfield work, start with issue reproduction, PR review, or regression tests before trusting multi-hour autonomous changes.

Comment insights

The comment section is small but revealing. One commenter reduces the lesson to “good engineering principles with strict rules and guidelines written in the format that the LLM is built on,” comparing agents/subagents/skills to data pipelines, modules, and APIs. That matches the talk’s best insight: agent reliability comes from engineering structure, not mystical autonomy. Another commenter asks “How can we use this,” which the video partially answers through tools but not with a full starter kit; the Actionable Insights section above fills that gap. The most important caveat is: “AI capability is not the same as AI deployment.” That is exactly right. A six-hour greenfield demo does not automatically become production deployment without security, cost controls, observability, CI, rollback, and human ownership.

Deep research

Sources checked:

Anthropic Engineering, “Effective harnesses for long-running agents” (Nov 26, 2025): describes the long-running agent problem, why compaction alone is insufficient, and the initializer/coding-agent solution with init.sh, claude-progress.txt, git history, feature_list.json, one-feature-at-a-time work, and end-to-end testing.
Anthropic Claude Agent SDK docs: confirms the SDK exposes Claude Code-style agent loops, tools for files/commands/edits, context management, and Python/TypeScript APIs; install commands include pip install claude-agent-sdk and npm install @anthropic-ai/claude-agent-sdk.
Anthropic claude-quickstarts autonomous-coding repo: public quickstart for the initializer + coding-agent pattern.
Playwright and Puppeteer documentation were used as named tool references for browser-level verification; the transcript mentions Playwright and Puppeteer directly.
AI Engineer Europe 2026 event pages/search results confirm the talk context, speakers, and conference framing.

Supporting evidence: The Anthropic blog directly supports the talk’s claims that agents fail by one-shotting too much, leaving half-implemented features, declaring done too early, and needing durable artifacts plus E2E testing. The SDK docs support that these primitives can be implemented programmatically. The transcript provides additional pattern detail: adversarial evaluator, contracts, rubrics, trace reading, and model/harness co-evolution.

Contradicting or limiting evidence: The talk is Anthropic’s own account, not an independent benchmark. Several claims about Opus 4.6, internal costs, and internal model behavior are not externally reproducible from public sources in this environment. The showcased pattern is expensive and mostly greenfield; the Q&A explicitly says brownfield production use needs more project-specific testing and control. Browser automation and model vision still miss some classes of bugs.

Verdict

Claim: Long-running agents need harnesses, not just bigger context. Agree, high confidence. Anthropic’s blog and the transcript both show why compaction/context windows alone do not solve planning and verification.
Claim: Initializer + persistent artifacts improve multi-session coding. Agree, high confidence. This is directly supported by Anthropic’s public engineering post and quickstart.
Claim: Separate adversarial evaluators outperform self-evaluation for complex app work. Agree, medium-high confidence. The transcript gives strong internal evidence and the reasoning is sound, but public independent benchmarks are limited.
Claim: Subjective quality can be graded with rubrics. Agree with caveats, medium confidence. Rubrics improve consistency, but they encode taste and can still miss user preference or accessibility issues.
Claim: The newest models make much older scaffolding dead weight. Mixed, medium confidence. The idea is plausible and supported by Anthropic’s internal observations, but exact model-specific claims are hard to independently verify and will vary by task.
Claim: This pattern is ready for production brownfield automation. Mixed to disagree if interpreted broadly. It is useful as a pattern, but the Q&A itself admits greenfield fit is stronger and brownfield requires CI, repo-specific rubrics, security, cost limits, and human review.

Screen-level insights

0:44 — Title slide: “How to Build Agents That Run for Hours (Without Losing the Plot)” with Ash Prabaker and Andrew Wilson, Applied AI, Anthropic. This establishes that the session is a technical harness-design talk, not a product launch.
1:44 — Agenda slide with “A year of…” and “Harness…” under AI Engineer Europe branding. The visual structure matches the transcript’s history-tour-then-state-of-art sequence.
2:45 — Slide “Three reasons agents lose the plot”: context, planning, verification. This is the clearest visual anchor for the failure taxonomy.
3:47 — Slide on “Can’t carry…” and finite windows/coherence degradation. It reinforces that context failure is concrete, not just a metaphor.
5:21 — Claude Agent SDK architecture diagram: model loop, tools, MCP, subagents, permissions, project context, sessions. This visual matters because the talk’s “harness” is not abstract; it is a container of specific primitives.
13:01 — “Effective harnesses for long-running agents” diagram shows persistent artifacts (feature_list.json, claude-progress.txt, git history) and the loop: get bearings, smoke test, pick failing feature, implement, E2E self-verify, leave clean state. This is the most directly reusable screen in the talk.
18:05–19:06 — Conference-stage frames while introducing the generator/evaluator idea. The screen evidence is less detailed here, but the transcript ties this segment to adversarial verification and Playwright-driven evaluation.
22:45 — Slide “The frontend loop in action” with a Dutch art museum example, Playwright navigation/scoring, and 5–15 iterations over about four hours. This visually demonstrates the evaluator loop producing higher-polish frontend output.
34:27 — Slide saying “Then Opus 4.6 shipped — and half of what we just showed you became dead weight.” It supports the harness co-evolution point: remove scaffolding when the model no longer needs it.
39:03–40:36 — Closing-stage frames/Q&A. The visible content is mostly conference branding, so the transcript carries the substance: use adversarial evaluators, structured handoffs, rubrics, and trace reading.

My read / why it matters

This is one of the more useful agent-engineering talks because it avoids the fake magic of “just prompt harder.” The speakers describe agents as software systems that need state, contracts, tests, reviewers, logs, and iteration. That is exactly the right mental model for technical teams trying to move from impressive demos to reliable autonomous work.

The most important operational lesson is that you should design the harness around observed failures. If the model forgets, add durable state. If it rubber-stamps, add a harsh evaluator. If critiques are vague, add granular contracts. If a new model no longer needs a workaround, delete it. That loop is much more valuable than copying any single Anthropic prompt.

Verification notes

Verification passes performed: (1) source/evidence audit against Anthropic’s public engineering post, Claude Agent SDK docs, the autonomous-coding quickstart, and named browser-testing tools; (2) transcript/comment/frame fidelity audit against 147 transcript chunks, six comments, and fourteen extracted keyframes with visual analysis; (3) hallucination/overclaim audit that softened unreproducible internal model/cost claims and separated public sources from Anthropic internal observations; and (4) Actionable Insights audit to ensure the top section contains concrete files, commands/tools, first experiments, evaluation criteria, links, and cautions. Corrections made: framed Opus/model-specific claims as internal observations rather than independent facts; added greenfield/brownfield deployment limits from the Q&A and comments. Residual uncertainty: exact performance/cost claims for unreleased or future-named model variants cannot be independently verified from public sources here, and the talk’s demos are not public benchmarks.