Everything We Got Wrong About Research-Plan-Implement — Dexter Horthy

MLOps.community26:46Transcript ✅Added Jun 7, 1:51 am GMT+8

Actionable Insights

Split coding-agent work into smaller deterministic stages instead of one mega-prompt. Replace a single “research-plan-implement” instruction with files such as questions.md, research.md, design.md, outline.md, plan.md, and finally a PR. Horthy’s revised flow is questions → research → design → structure outline → plan → worktree → implement → PR; the research he cites says instruction following degrades as instruction count and long context grow. First experiment: take one brownfield feature and force the agent to produce only objective research before any design. Success looks like fewer skipped steps and fewer surprise decisions in code review.
Hide the ticket from the research context to reduce premature opinions. Horthy’s concrete fix is to use one context/window to generate research questions, then a fresh context with no “what we are building” ticket to gather facts about the codebase. Prompt shape: “Trace how endpoints work; trace all spline-related workers; list current patterns and invariants; do not propose implementation.” Evaluate by checking the research doc for facts, file paths, and current behavior rather than recommendations. Caution: humans still choose the questions; bad questions produce blind research.
Use a short design artifact for human-agent alignment before code. Create design.md with current state, desired end state, patterns to follow/avoid, resolved decisions, open questions, and explicit owner choices. Horthy compares this to doing “brain surgery on the agent” before it writes 2,000 lines. The review target should be ~200 lines, not a 1,000-line plan; success is catching wrong patterns before implementation, especially in unfamiliar repos.
Prefer vertical implementation outlines over horizontal layer-by-layer plans. Ask the agent to break work into testable vertical slices: mock endpoint → UI path → service integration → data migration → final cleanup, rather than all database then all services then all frontend. Put checkpoints after each 200–400-line block for sensitive work. Evaluate by whether each phase can be run or reviewed independently; if the outline cannot say how to test a phase, it is probably too horizontal.
Read the code for production systems; use AI for leverage elsewhere. Horthy explicitly reverses prior “don’t read the code” advice. For paid, regulated, or paged systems, review the actual diff, not just the plan. Save time by reviewing design/outline earlier, using focused tests and static checks, and sending short design artifacts to code owners before implementation. Practical metric: target 2–3x throughput with near-human quality, not 10x output followed by six months of cleanup.
Control context and instruction budget aggressively. Reduce always-on MCPs/tools, keep project instructions short, and externalize important state into repo artifacts instead of relying on auto-compaction. The talk’s “dumb zone” heuristic is not a law, but the operating rule is useful: smaller context windows, fewer instructions, and simpler tasks improve adherence. Evaluate by tracking failures where the agent skipped workflow requirements; remove or stage instructions that are not relevant to the current phase.
Use the public references, but treat them as evolving practice, not doctrine. Relevant starting points include HumanLayer’s GitHub repo, its .claude/commands/create_plan.md, Horthy’s 12-factor agents concept if available from the HumanLayer ecosystem, and research on long-context instruction following such as “Large Language Models Cannot Follow Multiple Instructions at Once” and “Improving Long Context Instruction Following.” Fork the workflow, run it on a real feature, and measure rework/code-review defects rather than adopting it because the slides sound plausible.

Core thesis

Research-Plan-Implement was useful but too monolithic and too dependent on expert prompting. The revised thesis is: do not outsource engineering thinking, do not trust huge plans as the review target, keep research factual, align on design and structure in short artifacts, then read and own the code.

Big ideas / key insights

“No magic prompt” remains true; workflows should not require hidden magic words.
Good research is factual. If the research context knows the implementation goal too early, it starts giving opinions.
Instruction budget matters: long prompts plus tools plus MCP plus repo instructions make skipped steps more likely.
Plans are often as long as the code and can diverge from implementation, so deep-reviewing plans is low leverage.
Design discussions and structure outlines are better review surfaces because they are shorter and catch wrong assumptions earlier.
Vertical plans create testable checkpoints; horizontal plans accumulate untested cross-layer changes.
Reading production code remains non-negotiable when users, money, regulation, or paging are involved.

Best timestamped moments

2:06–2:36 — The reversal. Horthy lists “stuff we got wrong”: not reading code, reading long plans, and allowing “a little slop.” This frames the talk as a correction, not a launch of a perfect method.
3:06–7:10 — Magic words were a product failure. The prompt “work back and forth with me starting with your open questions and outline before writing the plan” made the agent ask questions. Horthy correctly says users should not need workshop lore for the tool to work.
7:41–8:12 — Instruction budget. The talk cites research that frontier models struggle as instruction counts rise; this supports splitting workflows into smaller prompts.
8:12–10:13 — Read the code. Reviewing 1,000-line plans is not leverage if the code differs anyway. For production systems, own the diff.
11:14–13:48 — Better research and context engineering. The ticket is hidden from research execution; tasks are split across smaller contexts.
14:49–17:21 — Design and outline as review surfaces. The design doc captures current/desired state and patterns; the outline resembles a C header, exposing shape without all implementation details.
17:51–18:51 — Vertical plans. This is the most operational implementation advice: build and verify slices instead of layers.
23:25–26:26 — Q&A nuance. Horthy admits code review is not perfectly scalable but argues 2–3x with quality beats 10x slop; he also treats the “dumb zone” as a heuristic that varies with task and model.

Practical takeaways / recommended workflow

Start with a human-owned ticket and generate research questions.
Run objective codebase research in a fresh context that does not know the intended implementation.
Write design.md: current state, desired state, constraints, patterns, and open questions.
Review design.md with the human and, where useful, the code owner.
Write outline.md: vertical phases, test checkpoints, key types/interfaces, and risk points.
Spot-check plan.md only after the design and outline are aligned.
Implement one vertical slice at a time in a worktree.
Run focused tests/static checks after each slice.
Read the final diff before merge, especially in production systems.
Track rework, review defects, incident risk, and time-to-merge; optimize for quality-adjusted speed.

Comment insights

The comments are unusually substantive and skeptical. The top liked comment reframes the whole space: agentic coding should make humans decide important things instead of automating them out of the loop. Many commenters criticized the whiplash from prior “don’t read the code” messaging to “please read the code,” with some calling it honest scientific iteration and others calling it grift. A long practitioner comment argues the deeper fix may be new LLM-native software paradigms: denser semantics, stronger types, verifiable abstractions, functional/declarative languages, and deterministic fitness functions. Other practical comments recommend directive coding, keeping plans/specs in the repo, reading both plans and code when stakes are high, optimizing boot-up context, and using hooks/structures to take pressure off the model.

Deep research on the main claims

Claim: LLMs struggle with too many instructions and long context. Support: search results surfaced “Large Language Models Cannot Follow Multiple Instructions at Once” on OpenReview, which reports models struggle to follow all instructions as instruction count increases, and “Improving Long Context Instruction Following” in ACL Anthology, which frames degradation in long-context instruction following as a critical challenge. Scale AI has also written about long-context instruction-following degradation. Contradiction/caution: the exact 150–200 instruction number from the talk was not independently verified from the full cited paper here, and newer models may move the threshold.
Claim: Splitting prompts/workflows improves reliability. Support: this is consistent with general workflow engineering: deterministic control flow beats asking a model to internally manage many phases. Horthy’s own 12-factor-agent framing emphasizes smaller tasks, clearer context, and explicit control flow. Contradiction/caution: more stages can add process overhead and new artifacts that teams may ignore or let drift.
Claim: Reviewing long plans is lower leverage than reading code. Support: the transcript gives a direct operational reason: plans can be about as long as the code and implementation can differ from the plan. This aligns with standard engineering practice: the code is what runs. Contradiction/caution: several commenters argue plans/specs still deserve review because they are contracts and future artifacts; the better verdict is not “never read plans,” but “deep-review shorter design/outline artifacts and always read production diffs.”
Claim: Vertical plans are better than horizontal plans. Support: incremental vertical slicing is a long-standing delivery practice because it creates testable checkpoints and reduces late integration risk. The talk gives a concrete contrast: all-db/all-service/all-frontend plans create a large unverified block; vertical slices can be tested every few hundred lines. Contradiction/caution: migrations, platform changes, or pure infrastructure work sometimes need horizontal sequencing, but they still need checkpoints.
Claim: AI coding can deliver 2–3x productivity with quality if humans remain in the loop. Support: the transcript presents this as a pragmatic target rather than a measured result in this packet. Broader software productivity research remains hard; Horthy himself says measuring developer productivity is unresolved. Contradiction/caution: without controlled metrics, “2–3x” should be treated as an aspiration or team-specific observation, not a general fact.

Verdicts on major claims

“Do not outsource the thinking” — agree, high confidence. Transcript, comments, and basic risk analysis all support this. Practical takeaway: agents can compress exploration and drafting, but humans own requirements, tradeoffs, and acceptance.
Objective research should be separated from implementation intent — agree, medium-high confidence. The fresh-context design is plausible and operationally testable. What is underclaimed: humans still need strong question formulation and may need multiple research passes.
Instruction budget is a real constraint — agree, medium confidence. External research supports instruction/context degradation generally; the exact numeric threshold is uncertain. Practical takeaway: stage instructions and remove irrelevant tools rather than arguing about exact token percentages.
Do not read long plans; read code — mixed, high confidence on reading code. The code-reading part is correct for production systems. The overclaim would be treating plans as useless; shorter design/outline docs and repo-stored specs are still valuable.
Vertical implementation plans reduce rework — agree, high confidence. This is established delivery wisdom and maps well to agent failure modes. Practical takeaway: require testable increments unless there is a clear reason not to.
2–3x with near-human quality is the right target — mixed, low-medium confidence. It is a healthier target than 10x slop, but not proven by the packet. Treat it as a measurement hypothesis.

Screen-level insights

2:06 frame — “Some stuff we got wrong.” The slide visibly lists prior bad advice: it is okay not to read code, you should read long plan files, and Claude can have some slop. This matters because the talk is self-corrective and frames the rest as lessons from failures.
2:36 frame — “no slop” section. The presenter emphasizes that production code used by real users must be owned by engineers. The visual supports the transcript’s pivot from automation enthusiasm to accountability.
3:06 frame — RPI prompt usage. The speaker polls who has run research code base and create plan, then contrasts naive prompts with the “work back and forth” magic words. This matters because it shows the workflow failed ordinary users unless they knew hidden phrasing.
6:09 frame — planning interaction diagram. The slide shows a faint workflow/cards diagram while the transcript describes the agent asking options, the user choosing, and only then writing the plan. The visual reinforces that alignment should happen before plan generation.
8:12 frame — plan review failure. The presenter discusses prior advice to read long plans and why it was low leverage. The on-screen moment connects to the later recommendation to review shorter artifacts and read code.
12:17 frame — context/dumb-zone discussion. The frame corresponds to the “better instructions, simpler tasks, smaller context windows” section. It matters because the proposed fix is architectural workflow decomposition, not another larger prompt.

My read / why it matters

This is valuable because it publicly retreats from a seductive but dangerous idea: that a good enough plan lets engineers stop reading code. The stronger contribution is not the acronym shift from RPI to “crispy”; it is the review-surface shift from giant generated plans to compact design/outline artifacts plus owned diffs. The talk also usefully normalizes changing advice as evidence accumulates, but the skeptical comments are right to demand measurements and to distrust confident frameworks that may change again.

Verification notes

Source/evidence audit: checked full transcript packet, comments, frame descriptions, HumanLayer GitHub search result, long-context/instruction-following research search results, and related context-engineering references.
Transcript/comment/frame fidelity audit: all claims about RPI, magic words, fresh research context, design/outline/plan stages, vertical plans, reading code, and Q&A caveats are tied to transcript timestamps; comment claims are distilled rather than dumped.
Hallucination/overclaim audit: marked the 150–200 instruction number and 2–3x productivity target as uncertain; avoided claiming the revised workflow is empirically proven across all teams.
Actionable Insights audit: bullets include concrete files, prompt shapes, first experiments, evaluation criteria, and cautions; links are provided where available.
Residual uncertainty: no direct slide deck fetch or full paper reading was performed; external source support is based on search results and named sources, so exact paper findings should be verified before citing in formal work.