← Back to library

LLM codegen fails and how to stop 'em — Danilo Campos, PostHog

AI Engineer19m 18sTranscript ✅Added May 2, 5:52 am GMT+8

Source quality: direct transcript extracted successfully; comments extracted from the top available YouTube comments.

Actionable Insights

  1. capture failures as eval cases and regression tests instead of treating each bad generatio. n as a one-off. Start by turning this into a small, reversible pilot: write down the exact input, expected output, owner, and success metric before changing the wider workflow. The useful detail from the analysis is: Autonomous codegen works when you stop treating the model as a magic programmer and start treating it as a capable but context-hungry agent that needs fresh documentation, good examples, sequenced instructions, constrained tools, and feedback loops. Maintain reference implementations. Create small “model airplane” apps for each major framework/language/integration shape. Treat the first run as an evaluation, not a migration: capture before/after examples, note where the method saves time or improves quality, and keep the old path available until the new one passes repeated checks. Watch for the main failure mode here: overgeneralizing the creator’s demo beyond the evidence. If the video or comments only showed a narrow case, keep the rollout narrow and require fresh proof before broad adoption.

  2. ask agents for one behavior change at a time, run tests, inspect diffs, then continue. Start by turning this into a small, reversible pilot: write down the exact input, expected output, owner, and success metric before changing the wider workflow. The useful detail from the analysis is: This is one of the more useful production-agent talks because it avoids the vague “agents are the future” layer and gets into the boring parts that actually make codegen reliable: context freshness, examples, sequencing, feedback, and permissions. Treat the first run as an evaluation, not a migration: capture before/after examples, note where the method saves time or improves quality, and keep the old path available until the new one passes repeated checks. Watch for the main failure mode here: overgeneralizing the creator’s demo beyond the evidence. If the video or comments only showed a narrow case, keep the rollout narrow and require fresh proof before broad adoption.

  3. upgrade models when useful, but invest first in reproducible context and feedback loops. Start by turning this into a small, reversible pilot: write down the exact input, expected output, owner, and success metric before changing the wider workflow. The useful detail from the analysis is: Autonomous codegen works when you stop treating the model as a magic programmer and start treating it as a capable but context-hungry agent that needs fresh documentation, good examples, sequenced instructions, constrained tools, and feedback loops. This is one of the more useful production-agent talks because it avoids the vague “agents are the future” layer and gets into the boring parts that actually make codegen reliable: context freshness, examples, sequencing, feedback, and permissions. Treat the first run as an evaluation, not a migration: capture before/after examples, note where the method saves time or improves quality, and keep the old path available until the new one passes repeated checks. Watch for the main failure mode here: overgeneralizing the creator’s demo beyond the evidence. If the video or comments only showed a narrow case, keep the rollout narrow and require fresh proof before broad adoption.

  4. 1. Treat stale model knowledge as a default failure mode. Assume the model does not know y. our current API. Start by turning this into a small, reversible pilot: write down the exact input, expected output, owner, and success metric before changing the wider workflow. The useful detail from the analysis is: Treat stale model knowledge as a default failure mode. Assume the model does not know your current API. Treat stale model knowledge as a default failure mode. Assume the model does not know your current API. Treat the first run as an evaluation, not a migration: capture before/after examples, note where the method saves time or improves quality, and keep the old path available until the new one passes repeated checks. Watch for the main failure mode here: overgeneralizing the creator’s demo beyond the evidence. If the video or comments only showed a narrow case, keep the rollout narrow and require fresh proof before broad adoption.

  5. 2. Serve fresh docs as markdown. Let the agent choose relevant docs and insert them into c. ontext. Start by turning this into a small, reversible pilot: write down the exact input, expected output, owner, and success metric before changing the wider workflow. The useful detail from the analysis is: Serve fresh docs as markdown. Let the agent choose relevant docs and insert them into context. Serve fresh docs as markdown. Let the agent choose relevant docs and insert them into context. Treat the first run as an evaluation, not a migration: capture before/after examples, note where the method saves time or improves quality, and keep the old path available until the new one passes repeated checks. Watch for the main failure mode here: overgeneralizing the creator’s demo beyond the evidence. If the video or comments only showed a narrow case, keep the rollout narrow and require fresh proof before broad adoption.

Creator’s main claims

  1. LLM code generation fails in predictable ways, not random magic ways.
  2. Reliable codegen requires constraints: context, tests, small scopes, style rules, and review gates.
  3. Production systems should use evals and telemetry to detect regressions in generated code quality.
  4. “Agent writes code” is less important than the surrounding product workflow that verifies and repairs the code.
  5. PostHog’s scale gives practical evidence that autonomous codegen needs guardrails.

Deep research verdicts

1. Codegen failure modes are predictable enough to engineer around

Verdict: Strong agree, high confidence. The specific PostHog docs were not reachable from this environment, but the claim is consistent with observed agentic coding failures and eval-platform practice.

Supporting evidence: common LLM codegen failures include stale APIs, missing edge cases, untested assumptions, style drift, overbroad edits, and hallucinated dependencies. Braintrust’s eval docs support the need for datasets, tasks, scorers, CI/CD evals, and production monitoring because LLM systems can regress silently. Source: https://www.braintrust.dev/docs/guides/evals

Contradicting / limiting evidence: some codegen failures are highly project-specific and only become obvious in production-like environments.

Practical takeaway: capture failures as eval cases and regression tests instead of treating each bad generation as a one-off.

2. Small scopes and tests beat broad autonomous delegation

Verdict: Strong agree, high confidence. This matches the best current practice for coding agents.

Supporting evidence: Claude Code docs emphasize specific, concise, concrete instructions and project memory because agent behavior depends heavily on context quality. Source: https://docs.anthropic.com/en/docs/claude-code/memory

Contradicting / limiting evidence: large migrations can be delegated when mechanical and backed by exhaustive tests.

Practical takeaway: ask agents for one behavior change at a time, run tests, inspect diffs, then continue.

3. Product workflow matters more than model choice

Verdict: Mostly agree, medium-high confidence. Better models help, but verification loops compound.

Supporting evidence: eval systems, CI, code review, telemetry, and prompt/memory discipline all reduce failure regardless of model. The PostHog framing also aligns with other talks in this library: AI makes code cheaper, but software quality depends on process.

Contradicting / limiting evidence: model capability still matters for hard reasoning, large refactors, and unfamiliar frameworks.

Practical takeaway: upgrade models when useful, but invest first in reproducible context and feedback loops.

Core thesis

Autonomous codegen works when you stop treating the model as a magic programmer and start treating it as a capable but context-hungry agent that needs fresh documentation, good examples, sequenced instructions, constrained tools, and feedback loops.

Danilo’s strongest claim is that the PostHog Wizard succeeds not because it is mostly clever code, but because it is mostly high-quality prose and context engineering: “90% markdown files, 8% tools for delivering and processing markdown files, and the rest agent harness stuff.”

Big ideas / key insights

1. Model rot is unavoidable for fast-moving software

At 2:18–4:25, Danilo explains that LLMs are snapshots of the web from months ago. For fast-moving libraries and APIs, that means the model is often confidently wrong: inventing keys, making up APIs, and applying stale integration patterns.

His practical answer is not fancy retrieval first; it is fresh markdown context. With today’s large context windows, PostHog lets the agent select up-to-date documentation and slide it directly into context.

2. Give agents “model airplanes,” not full production apps

At 4:55–6:28, he introduces “model airplanes”: thin example projects that have the right shape of a real app without the complexity. They include PostHog integrations across frameworks and languages, with simplified features such as auth that is “auth-shaped” but not production-auth-complete.

This gives the model a concrete pattern for where integration code belongs while keeping the example token-efficient.

3. Breadcrumb the agent to limit improvisation

At 6:58–9:31, Danilo warns that if 15,000 monthly integrations produce 15,000 different implementation styles, support becomes a nightmare. The solution is to sequence the task.

Instead of telling the agent “integrate PostHog” up front, the Wizard first asks it to find files with business value: login, Stripe, churn signals, and other meaningful product events. Then it asks what events are worth tracking. Only after the agent has built that intermediate understanding does it implement the integration.

The lesson: don’t over-specify the destination too early. Shape the path.

4. The biggest source of agent failure is often human error

At 9:31–12:06, he makes a funny but serious point: humans have context limits too. Teams change prompts, tool definitions, docs, and instructions, then forget contradictions or missing pieces.

PostHog catches this with inference-time interrogation at the stop hook: after every run, they ask the agent what could have been done better to set it up for success. This surfaced missing tool permissions, contradictory tool instructions, and language-mismatched guidance such as JavaScript instructions inside a Python project.

5. Tool permissions need to prevent “successful but creepy” behavior

At 12:06–13:38, Danilo describes an early Wizard version reading .env files because file writes mechanically require reads. That solved the integration task but risked sending sensitive environment contents into cloud inference logs.

They fixed it by locking down reads around env files and giving the agent a narrow tool that could only check whether a key exists and write a new value. The principle: the agent can fulfill the user’s request and still violate trust if tool access is too broad.

6. Prose is becoming a compounding asset

At 14:09–16:15, he argues that code depreciates, but good prose/context can appreciate as models improve. The Wizard’s value lives mostly in markdown, model examples, and sequencing rather than elaborate scaffolding.

The agent metaphor is an octopus: it can wriggle around problems if you give it enough information and sequence that information well. Overconstraining it with code can reduce its ability to adapt.

Best timestamped moments

  • 0:15 — “I’m not afraid of robots because they have already bloodied my nose.” Sets the tone: this is operational learning from painful production experience.
  • 0:45 — The Wizard turns “two hours of misery” into “8 minutes of pseudo entertainment.” Clear framing of the product value.
  • 1:15 — Scale claim: 15,000 people per month run the Wizard and get working integrations.
  • 2:18 — Model rot: models are expensive snapshots of an older web.
  • 3:23 — “RAG is good,” but current context windows make raw fresh markdown extremely effective.
  • 4:25 — Stale models invented APIs and keys; not PostHog’s fault, but PostHog’s problem.
  • 5:25 — “Model airplanes” as thin reference apps with correct integration shape.
  • 6:58 — 15,000 successful integrations can still be a support disaster if every one is structured differently.
  • 7:29 — Breadcrumbing prevents the agent from blasting through the task in a brittle “Claude Code shaped” path.
  • 8:00 — Ask for business-value files before asking for integration code.
  • 11:02 — Stop-hook interrogation: ask the robot user what would have made the run more successful.
  • 12:06 — Security/trust section: agents running on user machines must not read more than necessary.
  • 13:07 — Narrow env-file tool: check key presence and write value, without exposing contents.
  • 14:42 — Code is a depreciating asset; prose/context becomes more valuable as models improve.
  • 15:13 — The Wizard is mostly markdown, not code.
  • 16:46 — Q&A reveals the implementation: generated skill files from a context service, with model airplanes flattened into markdown references.
  • 18:19 — The system uses the Claude Agent SDK wrapped in a CLI, with PostHog covering inference through an LLM gateway after login.
  1. Treat stale model knowledge as a default failure mode. Assume the model does not know your current API.
  2. Serve fresh docs as markdown. Let the agent choose relevant docs and insert them into context.
  3. Maintain reference implementations. Create small “model airplane” apps for each major framework/language/integration shape.
  4. Flatten examples into agent-readable context. PostHog’s Q&A describes generated skill files that include docs plus model airplanes as references.
  5. Breadcrumb the task. Sequence the agent through discovery → event design → implementation instead of asking for the final integration immediately.
  6. Capture intermediate artifacts. Event names and descriptions are written into a small file before implementation, giving the agent a stable plan.
  7. Interrogate every run. At the stop hook, ask: “What could we have done better to set you up for success?” Aggregate those answers to find prompt/tool/doc defects.
  8. Constrain sensitive tools. Replace broad file access with narrow operations, especially around secrets such as .env files.
  9. Invest in prose. Keep high-quality markdown instructions, docs, examples, and skill files as first-class production assets.
  10. Avoid over-scaffolding. Give the agent enough context and constraints to succeed, but leave room for adaptive problem-solving.

Comment insights

The extracted comment set is tiny: one visible top comment.

Agreement / enthusiasm

The only extracted commenter reacts strongly to Danilo’s delivery: “One minute in and I already love this guy!” That suggests the talk’s humor and plainspoken style landed immediately, even before the technical content developed.

Disagreement patterns

No disagreement was present in the extracted comments. There is no comment-side pushback on the architecture, security model, or “prose over code” thesis in the available data.

Practitioner additions

No commenter added additional implementation patterns or field experience. The useful practitioner detail comes from the Q&A instead: PostHog uses generated skill files, a context service, flattened model-airplane markdown, the Claude Agent SDK, a CLI wrapper, and an LLM gateway for inference.

Memorable phrases from comments

  • “One minute in and I already love this guy!”

Pushback / caveats

No comment-derived caveats were extracted. Caveats from the talk itself are important: broad file access can leak secrets, stale docs cause hallucinated APIs, and unconstrained integrations create support burden even when they technically work.

Concrete tools/workflows mentioned by commenters

None in the comments. Concrete tools/workflows mentioned in the talk and Q&A include:

  • PostHog Wizard
  • Cursor as an example of primitive agent-driven integration attempts
  • fresh markdown documentation
  • “model airplane” reference projects
  • generated skill files
  • a context service that flattens examples into markdown
  • Claude Agent SDK
  • CLI wrapper
  • LLM gateway
  • stop-hook run interrogation
  • narrow .env tools for checking key presence and writing values without reading secret contents

My read / why it matters

This is one of the more useful production-agent talks because it avoids the vague “agents are the future” layer and gets into the boring parts that actually make codegen reliable: context freshness, examples, sequencing, feedback, and permissions.

The key inversion is that the valuable artifact is not necessarily more code. It is a maintained body of prose and examples that tells the agent what “good” looks like today. That is especially relevant for fast-moving products where model pretraining will always lag reality.

The most transferable pattern is the stop-hook question. It is cheap, humble, and powerful: ask the agent where your harness failed it. That turns every failed or awkward run into feedback about missing docs, broken permissions, contradictory instructions, or bad sequencing.

The security section also deserves attention. Agent UX can look magical while quietly doing unacceptable things under the hood. A production agent should not merely complete the task; it should complete it in a way that preserves trust.

Screen-level insights

  • No key-frame metadata was available for this video, so screen-level confidence is limited. Claims should be judged mostly from transcript, comments, and external sources.

Verification notes

  • Source/evidence audit: Checked the existing analysis against extracted transcript/comments and available frame metadata. Added missing sections so the public page is not a transcript packet.
  • Transcript/comment/frame fidelity: Timestamped and screen claims should trace to the extraction artifacts under youtube-extract/; comment claims are limited to the extracted top comments.
  • Hallucination/overclaim audit: Treat strong tool/productivity claims as hypotheses unless backed by official docs, reproducible commands, tests, or production metrics.
  • Actionable Insights audit: Existing top recommendations were preserved; added evidence caveats where missing so users know first experiments, cautions, and validation criteria.
  • Residual uncertainty: This repair pass validates structure and evidence discipline, but some older analyses may still deserve deeper bespoke research before high-stakes decisions.
  • Actionable Insights audit: expanded to the newer detailed format with fuller implementation notes, evaluation checks, and cautions where the existing evidence supports elaboration.