How we Claude Code
Speaker: Arnaud Doko, Anthropic Applied AI / architect, presenting at Code w/ Claude.
Actionable Insights
Use an interview prompt before you let a coding agent run long.
The video’s first practical pattern is to stop opening with “make it better” and instead ask Claude Code to interview you about audience, constraints, open questions, and desired outcomes before implementation. In Claude Code, the speaker explicitly calls out using theask_user_questionflow/tool in the prompt so the agent turns latent requirements into a concrete spec. First step to try: for your next non-trivial feature, ask: “Interview me with concise multiple-choice/open questions until the product spec has enough detail to build safely; focus on audience, scope, data model, failure cases, and acceptance criteria.” Evaluate it by checking whether the resulting spec removes decisions you would otherwise discover mid-build. Caution: do not let this become endless discovery; cap it to 5–10 questions and require a final assumptions list.Switch large plans/specs from Markdown to reviewable HTML artifacts when human review quality matters.
The workshop and supporting Anthropiccwc-workshopsrepo describe phase 2 as “four divergent visual design directions” rendered as static HTML mockups for side-by-side comparison: https://github.com/anthropics/cwc-workshops/tree/main/how-we-claude-code. The claim is not that Markdown is obsolete; it is that long Markdown specs often stop being read. First step: ask the agent to createdocs/plan.htmlorartifacts/design-directions/index.htmlwith tabs, screenshots/placeholders, flow diagrams, open questions, acceptance criteria, and visual alternatives. Evaluate by asking two humans, or a separate reviewer agent with screenshots, to identify unclear requirements. Caution: HTML can hide complexity and spend more tokens; only use it for plans that genuinely need visual hierarchy, comparison, or stakeholder review.Build a machine-readable runtime verification contract into UI components.
The strongest technical pattern is phase 3 of the repo: a Vite + React todo app where components emitdata-verify-*attributes, declare fixtures/invariants, and expose a structuredwindow.__verifyAPI. Repo: https://github.com/anthropics/cwc-workshops/tree/main/how-we-claude-code/phase-3-verify. Try this shape in your own React app: adddata-verify-unit,data-verify-total,data-verify-state, or domain-specific attributes to key components; define fixtures insrc/verify/specs/*.verify.ts; implement verifiers for schema, invariants, DOM contract, and a11y. Run commands from the workshop repo shape:bun install,bun run dev,bun run verify,bun run typecheck. Evaluate success by breaking a contract intentionally, e.g. removingdata-verify-total, and confirming both the dashboard and CI fail with a useful diagnosis. Caution: treat these attributes as a public test contract; version them if downstream agents or tests depend on them.Separate “tests” from “runtime verification” instead of pretending one replaces the other.
The speaker repeatedly distinguishes normal tests/typechecks from verification “at the surface”: run the app, drive it, and read what it actually shows. The repo README says tests and typechecks are CI’s job, while a verifier confirms the real artifact behaves. First step: add a/verifyroute or Storybook-like harness that mounts units in known states, then let Playwright or an agent inspect them. Evaluation criteria: each unit should have happy-path fixtures, at least one probe/adversarial fixture, and clear verdict states such asPASS,FAIL,BLOCKED,SKIP. Caution: runtime verification can itself become flaky; keep probes deterministic and distinguish “could not observe” from “observed and wrong.”Use Playwright/MCP or browser automation to give the agent the same surface a human sees.
The screen demo shows Chrome DevTools,/verify,/verify/replay, console output, and local routes where an agent can inspect DOM contracts. Anthropic’s Claude Code docs also describe MCP as the way to connect Claude Code to external tools and data sources, with explicit security warnings: https://code.claude.com/docs/en/mcp. First step: connect a trusted browser/Playwright server, then ask the agent to navigate to/verify, runwindow.__verify.manifest(), pick fixtures, and confirmwindow.__verify.current().verdict === 'PASS'. Evaluate by comparing dashboard results, agent-observed results, andbun run verify. Caution: MCP servers and browser pages can expose prompt-injection risk; only connect trusted servers and do not blindly follow page instructions.Record verification evidence for code review, not just green checks.
Around 21–29 minutes, the demo shows replaying verification steps and discussing downloadable clips / evidence bundles. This is useful when UI correctness is hard to communicate in a PR. First step: for frontend changes, capture a short Playwright trace/video or dashboard replay for every key fixture and attach it to the PR or internal artifact store. Evaluate by asking whether a reviewer can understand what was verified without rerunning the app. Caution: store only non-sensitive UI states; avoid leaking credentials, customer data, or proprietary workflows in recorded evidence.
Core thesis
As coding agents become capable of longer-running, more complex work, teams need to change the artifacts around them: let the agent extract requirements interactively, use richer human-readable planning artifacts when Markdown becomes unreadable, and make verification native to the built artifact so both humans and agents can inspect the same runtime truth.
Big ideas / key insights
- Frontload ambiguity reduction. The speaker argues that long agent runs waste tokens and time when the task is underspecified. The remedy is an interview/spec step before implementation.
- Review artifacts need ergonomics. HTML plans can compress structure, flows, alternatives, and screenshots into something humans are more likely to review than a 200-line Markdown file.
- Verification should be observable at runtime. Components expose state through DOM contracts so agents do not need to infer behavior from React internals.
- Use one truth surface for human, agent, and CI. The demo’s dashboard,
window.__verifyAPI, andbun run verifyall read the same fixture/verifier system. - Probe the unhappy path. The failed
TodoStats/inconsistent-countsexample is deliberate: a good verification framework catches lies and edge cases, not just happy-path renders.
Best timestamped moments with interpretation
- 1:22–2:24 — The workshop repo and “unreasonable effectiveness of HTML files” context are introduced. This grounds the talk in concrete materials rather than pure opinion.
- 2:24–3:25 — The main pressure: agents now run longer and can waste many tokens if they pursue the wrong goal. This justifies better specs and earlier verification.
- 3:56–5:26 — The speaker invokes Sutton’s “Bitter Lesson” to argue against over-constraining capable models. I read this as directionally useful but easy to overextend; product constraints and safety constraints still matter.
- 6:56–9:34 — “Make it better” is contrasted with prompting Claude to interview the user and ask targeted questions. This is the most immediately reusable prompting pattern.
- 10:35–12:09 — HTML design directions show why visual artifacts can be better feedback surfaces than plain text plans, especially for frontend work.
- 13:09–18:20 — The talk shifts from planning to verification, showing a todo app with Storybook-like fixtures, testing-library style interactions, DOM attributes, and browser inspection.
- 20:01–22:35 — The verification dashboard/replay exposes a deliberate failing fixture (
inconsistent-counts), proving the framework can detect contradictions. - 23:06–25:26 — The presenter breaks the DOM contract without breaking the app, illustrating why machine-readable contracts matter independently of visible UI behavior.
- 29:37–31:14 — The closing argues that embedded verification and richer artifacts reduce iteration even if they cost more tokens upfront.
Practical takeaways / recommended workflow
- Interview → spec. Ask Claude Code to interview you and produce a spec with assumptions, non-goals, acceptance criteria, and risk areas.
- Spec → HTML plan/design directions. For non-trivial UI/product work, generate multiple static HTML directions and review screenshots or live pages.
- Plan → verifiable architecture. Add stable DOM contracts (
data-verify-*) to important components and define fixtures/invariants next to the feature. - Verify three ways. Provide a human dashboard (
/verify), an agent API (window.__verify.*), and a CI command (bun run verify). - Add probes. Every component/feature should include at least one adversarial fixture that can fail for a meaningful reason.
- Record evidence. For UI-heavy PRs, attach replay/trace/video evidence so review is not just “tests passed.”
Comment insights
The comments are sparse and mostly lightweight praise. The most useful signal is the top comment asking Anthropic to share the repo “so we can all experiment,” which aligns with the video’s hands-on nature and validates that the reusable artifact is the workshop code, not just the talk. Another comment jokes that “Claude interviewing you is literally AI programming you,” which is a real caution: requirement extraction is helpful, but teams should keep the human in charge of goals and constraints rather than letting the agent steer product direction uncritically. There is no substantive technical pushback in the extracted comments.
Deep research
Claim 1: HTML can be a better planning/review artifact than long Markdown for agent-generated specs.
- Supporting evidence: The workshop repo’s
how-we-claude-codeREADME says phase 2 uses “four divergent visual design directions” rendered as static HTML mockups for side-by-side comparison, and phase 3 uses a runtime-verifiable React app (https://github.com/anthropics/cwc-workshops/tree/main/how-we-claude-code). A public summary of Thariq Shehzad’s “unreasonable effectiveness of HTML” argument quotes the rationale: long Markdown specs often are not read, while HTML gives tabs, illustrations, links, and visual structure (https://rogerwong.me/2026/05/what-humans-actually-read). - Contradicting/cautionary evidence: The same summary acknowledges Markdown uses fewer tokens and is simpler/portable. HTML adds surface area: styling, broken links, hidden content, and possible mismatch between mockup and implementation.
- Assessment: Verified as a plausible workflow improvement for large visual/product specs, not a universal replacement for Markdown.
Claim 2: More capable models/agents should be constrained less and allowed to extract requirements.
- Supporting evidence: Richard Sutton’s “The Bitter Lesson” argues that general methods leveraging computation have historically beaten hand-coded human-knowledge approaches at scale (http://www.incompleteideas.net/IncIdeas/BitterLesson.html). Claude Code documentation also recommends broad-to-specific workflows and planning before editing for ordinary coding tasks (https://code.claude.com/docs/en/common-workflows).
- Contradicting/cautionary evidence: Sutton’s essay is about AI research methods over decades, not product requirements gathering. In software delivery, constraints around safety, compliance, budget, UX, and maintainability are not optional. Over-applying “do not constrain the model” can lead to scope creep, insecure implementations, or solutions that are impressive but misaligned.
- Assessment: The “let Claude interview you” part is strong; the broader Bitter Lesson analogy is useful but not a license to remove guardrails.
Claim 3: Runtime DOM contracts make UI verification more agent-native.
- Supporting evidence: The workshop README explicitly describes components emitting
data-verify-*attributes, declaring fixtures/invariants, using verifiers for schema/invariants/DOM contract/a11y, and exposing__verify.manifest(),__verify.current(), and__verify.runAll()for machine consumers (https://github.com/anthropics/cwc-workshops/blob/main/how-we-claude-code/phase-3-verify/README.md). This supports the video’s claim that agents can read a stable surface rather than scraping arbitrary UI or React internals. - Contradicting/cautionary evidence: This pattern can create a second contract to maintain. If
data-verify-*attributes drift from real behavior, agents may verify the contract rather than the user experience. It also does not replace unit tests, integration tests, accessibility audits, or real user telemetry. - Assessment: Strong for UI components and agent-driven checks, especially when paired with probes and CI. It is not a complete quality system by itself.
Claim 4: Browser/Playwright/MCP-driven verification is a practical way to align human and agent review.
- Supporting evidence: The video’s frames show
/verify,/verify/replay, DevTools, and a failing fixture; the repo documents asking an AI agent or Playwright to open/verify, callwindow.__verify.manifest(), navigate to routes, and inspectwindow.__verify.current(). Anthropic’s MCP docs state Claude Code can connect to external tools and data sources via MCP, while warning to trust servers and account for prompt injection (https://code.claude.com/docs/en/mcp). - Contradicting/cautionary evidence: Browser automation can be flaky; visual checks depend on environment, fonts, viewport, and data setup. MCP expands the trust boundary.
- Assessment: Practical and well-supported for controlled local verification; needs security hygiene and deterministic fixtures before production use.
Verdicts on major claims
| Claim | Verdict | Confidence | What is overclaimed / underclaimed | Practical takeaway |
|---|---|---|---|---|
| HTML plans improve agent/human collaboration for large specs. | Agree, with scope. | Medium-high | Overclaimed if treated as “HTML always beats Markdown.” Underclaimed: HTML can also carry runnable demos and links, not just prettier plans. | Use HTML for long, visual, stakeholder-facing specs; keep Markdown for concise technical notes. |
| Claude should interview you because requirements are latent and hard to specify upfront. | Agree. | High | Overclaimed if the model becomes the product owner. | Use interview prompts, but require human approval of assumptions/non-goals before build. |
| The Bitter Lesson implies we should resist constraining stronger coding agents. | Mixed. | Medium | The analogy is stretched: compute-scaling lessons do not remove the need for domain constraints, safety, and acceptance criteria. | Prefer outcome constraints and verifiable criteria over micromanaging implementation steps. |
| Runtime DOM contracts make agent verification more reliable. | Agree. | High | Overclaimed if it is presented as replacing tests. | Add stable machine-readable contracts and adversarial probes alongside normal tests/typechecks. |
| Verification evidence/replays are worth recording. | Agree for UI-heavy or regulated work. | Medium | Under-discussed: privacy and artifact retention. | Record traces/videos for review, but scrub sensitive data and define retention. |
Screen-level insights
- 0:20 — Title slide / speaker intro. The frame shows “Code w/ Claude,” “How we Claude Code,” Arnaud Doko, and Anthropic branding. This matters because the claims are presented as Anthropic workshop practice rather than a third-party tutorial.
- 2:24 — “Agents now run for hours, not minutes.” The slide visually anchors the talk’s motivation: longer-running agents require different workflows. It connects directly to the transcript’s point about changing habits as model capability grows.
- 8:32 — Claude Code CLI setup. The screen shows terminal windows, model/session setup, and a realistic auth warning. The transcript discusses
/effort,/fast, and auto mode; visually, this confirms the demo is happening inside a real Claude Code-like CLI environment rather than only slides. - 9:34 — Interview/spec flow in terminal. The CLI shows audience questions and a bill-split app task list while Claude Code v2.1.141 / Opus 4.7 / auto mode is visible. This is the concrete evidence for the “let Claude interview you” workflow.
- 10:35 — HTML vs Markdown plan slide. The slide contrasts a linear Markdown plan with a structured HTML plan containing labels and flow. It matters because the visual comparison makes the ergonomic argument clearer than the transcript alone.
- 15:44 — GitHub workshop repo. The frame shows
anthropics/cwc-workshops/how-we-claude-codewith phases andbuncommands. This is the strongest actionable screen: it identifies the repo and how to run phase 3. - 17:47 — Chrome DevTools / DOM contract. The browser and Elements panel show the presenter inspecting emitted DOM attributes. This is the core technical mechanism: agents can read stable attributes instead of guessing from the rendered UI.
- 20:01 — Verification dashboard. The
/verifypage shows 20 passes and 1 failure across 21 fixtures. This demonstrates the dashboard as a human-readable rendering of the same verification matrix an agent can consume. - 21:34 — Replay summary with failing
TodoStats/inconsistent-counts. The red failure is not incidental; it proves the framework catches a deliberately inconsistent state. This is why probes matter. - 25:26 — Code editor /
TodoApp.tsx. The visible React code and project tree connect the dashboard to actual component implementation, showing that verification is embedded in the app architecture rather than an external screenshot-only process. - 29:37 — Dashboard/console during wrap-up. The final visual reinforces live runtime observation: dashboard plus DevTools/console remain central to the workflow, not merely tests hidden in CI.
My read / why it matters
This is one of the more practical Claude Code workflow talks because it moves beyond “prompt better” into artifact design. The durable idea is: if agents are going to work longer, the surrounding files need to become more legible, reviewable, and verifiable. HTML planning artifacts help humans stay in the loop; DOM contracts and verifier dashboards help agents stay grounded in the real runtime surface.
The main caution is that this workflow can become theater if teams only generate beautiful plans and verification dashboards without enforcing them in CI or using adversarial probes. The useful version is boring and concrete: versioned contracts, deterministic fixtures, failing probes, bun run verify, typecheck, recorded evidence, and explicit human signoff on assumptions.
Verification notes
- Source/evidence audit: Checked the extracted transcript/comments, frame analysis, the public
anthropics/cwc-workshopsrepo, thehow-we-claude-codephase docs, Anthropic Claude Code docs for common workflows/MCP, Richard Sutton’s original “The Bitter Lesson,” and a public summary of Thariq Shehzad’s HTML-files argument. Strongest support: Anthropic repo and phase-3 README for the verification architecture; Sutton for the general compute-scaling analogy. Strongest cautions: Markdown’s simplicity/token efficiency, MCP trust/prompt-injection risk, and the danger of overextending the Bitter Lesson to product constraints. - Transcript/comment/frame fidelity audit: Timestamped claims were tied to extracted transcript chunks and visual frame analysis. The comments section was kept short because extracted comments contained little technical substance beyond the repo-sharing request and a cautionary joke about AI interviewing users.
- Hallucination/overclaim audit: Removed/avoided unsupported claims about internal Anthropic retention practices, exact unreleased model capabilities beyond what appears in transcript/frame text, and any claim that HTML or DOM contracts replace tests. Verdicts explicitly separate agreement from scope limits.
- Actionable Insights audit: The top section was checked for concrete first steps, links, commands/tools, evaluation criteria, and cautions. Each item is tied to video evidence or external sources and is workflow-ready rather than a summary.
- Residual uncertainty: The transcript is machine-extracted and contains some “cloud/code” recognition noise; exact CLI names and model labels are based on visible frames plus transcript and may reflect workshop environment specifics. External source coverage is sufficient for the main claims but not a full empirical study of HTML-vs-Markdown review effectiveness.