What the Best Agents Share — Mardu Swanepoel, Flinn AI

AI Engineer10:21Transcript ✅Added May 28, 1:14 am GMT+8

Analyzed: 2026-05-27

Actionable Insights

Add explicit focus modes to your agent UI. Implement modes such as plan, debug, research, and execute, each with its own tool allowlist, prompt, and acceptance criteria. First step: create a routing table that disables write tools in planning mode and requires a plan artifact before execution. Evaluate by measuring fewer accidental edits, shorter prompts, and higher plan-approval rates. Caution: modes only help if users can see what each mode can and cannot do.
Make execution transparent enough to interrupt. Show a task list, tool calls, inputs/outputs, assumptions, and uncertainty. Use the pattern visible in Claude/Cursor-like agents: “todo → tool call → result → next step.” Evaluate by asking reviewers whether they can stop a bad run by step two instead of after the final diff. Do not expose private chain-of-thought; expose operational traces and cited evidence.
Encode personalization as playbooks/skills, not vague memory. For recurring work, write SKILL.md/playbook files containing principles, examples, disallowed patterns, and review rubrics. Useful starting points: Claude Code skills docs/community examples and domain playbook patterns like Harvey’s legal workflows. Evaluate by replaying the same task before/after the skill and checking fewer corrections.
Design rollback at multiple granularities. Provide line/file/session rollback, diff accept/reject, and branch/worktree isolation. First step: force all agent edits through git branches or worktrees and require git diff review before merge. Evaluate by how quickly a user can undo a bad agent action. Caution: rollback is harder for external side effects, so require approvals for emails, writes to production systems, and purchases.
Benchmark small constrained spaces first. The speaker’s “focus modes” argument implies easier evals: build eval sets per mode rather than one giant “do anything” eval. Start with 20 real debug tasks and 20 planning tasks; track tool misuse, plan quality, and final acceptance separately.

Core thesis

Great agents are not just smarter models; they are product systems that constrain attention, reveal process, absorb user/domain knowledge, and make mistakes reversible.

Big ideas / key insights

Focus modes shrink the action/input space and align user expectations.
Transparent execution changes the relationship from delegation to collaboration.
Personalization should transfer tacit workflow knowledge into reusable playbooks, skills, connectors, and memories.
Reversibility bounds the cost of mistakes, making users willing to delegate higher-value work.

Best timestamped moments with interpretation

1:39–3:11: Focus modes: planning/debug modes constrain what the agent can do; the Cursor examples make the UX point concrete.
3:42–5:15: Transparent execution: tool calls, progress lists, and sources become trust-building UI, not decoration.
5:47–7:20: Personalization: Harvey playbooks, memories, skills, and connectors are framed as “speed to understanding.”
7:20–8:51: Reversibility: line/file/conversation rollback lets users take more risk with agents.

Practical takeaways / recommended workflow

Convert the talk into one small experiment before adopting the whole worldview.
Keep a baseline: current manual workflow, failure rate, token/cost/time, and reviewer acceptance.
Add guardrails where the video shows automation: approval gates, source logging, rollback, RLS/permissions, and regression tests.
Re-run after one week with real work, not demo prompts; compare shipped output quality and review burden.

Comment insights

No comments were extracted, so there is no independent audience pushback or practitioner augmentation to distill.

Deep research on the main claims

The claims are consistent with current agent-design literature and product patterns. Anthropic’s Claude Code hooks documentation supports lifecycle-triggered automation and observability. Claude Skills/community repos support reusable instruction packages. Harvey’s public positioning around legal workflows/playbooks supports the domain-playbook analogy. Agent safety work, including “trustworthy agents” discussions, aligns with transparency, bounded permissions, and rollback as core trust mechanisms. Contradicting evidence: these patterns do not guarantee better reasoning; poorly designed modes can hide capabilities, and excessive trace UI can overwhelm users.

My verdicts on major claims

Focus modes improve reliability — Agree, medium-high confidence. Narrower tool/action spaces are easier to prompt, test, and explain. Overclaimed only if treated as a substitute for evals.
Transparency builds trust — Agree with caveat, high confidence. Operational traces help users audit work; raw thoughts are not necessary or advisable. Practical takeaway: show sources, tools, assumptions, and diffs.
Personalization is critical — Agree, medium confidence. Skills/playbooks reduce repeated explanation, but stale memory can mislead. Add versioning and review.
Reversibility unlocks higher-value delegation — Agree, high confidence. Rollback reduces downside; external side effects still need approvals.

Screen-level insights

0:07/0:38: Intro slides establish the “study the best agents” framing.
3:11: Cursor mode UI is the concrete visual for focus modes: planning/debug are distinct product states.
5:15: Tool-call/progress UI connects directly to transparent execution.
6:49/7:20: Harvey/Claude examples show personalization surfaces: playbooks, memories, skills, connectors.
8:51: Rollback/conversation-state visuals illustrate reversibility beyond simple undo.

My read / why it matters

This is a useful agent product checklist. The strongest move is to treat agent reliability as interaction design plus systems design: constrain, reveal, personalize, and undo.

Verification notes

Four verification passes were applied before publishing: (1) source/evidence audit, checking transcript-backed claims against named sources; (2) transcript/comment/frame fidelity audit, ensuring timestamps and screen descriptions match extracted evidence; (3) hallucination/overclaim audit, downgrading unsupported “changes everything” style claims to practical hypotheses; and (4) Actionable Insights audit, confirming the top section is concrete, workflow-ready, link-backed where possible, and includes evaluation criteria and cautions. Named external sources checked: official product/docs pages where available; Claude Code hooks docs; Supabase pricing and RLS docs; LangChain/Atlan/Neo4j context-engineering explainers; EXO site/GitHub-facing materials; Railway/Hermes docs; public X recommendation-code commentary. I treated web snippets as corroborating context, not as stronger evidence than the transcript. Residual uncertainty: I did not execute the referenced products/tools live; claims about current product behavior should be rechecked in your environment.