Agents Don’t Do Standups: Building the Post-Engineer Engineering Org — Mike Spitz, PFF

AI Engineer17m 49sTranscript ✅Added May 18, 2:40 pm GMT+8

Creator/speaker: Mike Spitz, PFF
Duration: 17:49
Evidence used: extracted transcript/comments, key frames, and external sources listed below.

Actionable Insights

**The useful lesson is not “delete Scrum.” It is: when agents can implement faster than hu. The useful lesson is not “delete Scrum.” It is: when agents can implement faster than humans can coordinate, redesign the delivery system around specs, lightweight design docs, deterministic checks, rapid stakeholder feedback, and QA agents. Start by turning this into a small, reversible pilot: write down the exact input, expected output, owner, and success metric before changing the wider workflow. The useful detail from the analysis is: Spitz argues that once AI agents become the implementation bottleneck-breaker, traditional engineering rituals designed around human coordination — especially Scrum ceremonies — should be reconsidered. PFF’s case study claims two strong engineers using agents delivered a high-output product initiative faster than a larger traditional team, with fewer ceremonies and heavier reliance on specs, LDDs, automated tickets, review gates, and customer feedback. Treat the first run as an evaluation, not a migration: capture before/after examples, note where the method saves time or improves quality, and keep the old path available until the new one passes repeated checks. Watch for the main failure mode here: overgeneralizing the creator’s demo beyond the evidence. If the video or comments only showed a narrow case, keep the rollout narrow and require fresh proof before broad adoption.

Workflow to pilot

Pick a narrow, non-critical product slice.
- PFF started with experimentation before moving to a high-traffic product area.
- Choose a slice with clear acceptance criteria, feature flags, and rollback.
Use an agent-interviewed spec.
- Prompt the agent to interview product/design/engineering until ambiguity is low.
- Required spec fields:
  - ☐ User problem.
  - ☐ Non-goals.
  - ☐ Acceptance criteria.
  - ☐ Analytics/events.
  - ☐ Design constraints.
  - ☐ Security/privacy risks.
  - ☐ Rollout/feature flag plan.
Generate a lightweight design document (LDD).
- Include architecture, files likely touched, API contracts, data model, testing plan, and risk register.
- Review LDD before implementation. This is where humans should catch overengineering and missing assumptions.
Decompose into independent tickets.
- Require the agent to mark dependencies and blockers.
- Good ticket shape:
  - ☐ One independently reviewable change.
  - ☐ Deterministic acceptance criteria.
  - ☐ Test/QA plan.
  - ☐ Owner and risk label.
Replace ritual meetings with feedback loops, not silence.
- PFF replaced daily standups/sprint ceremonies with every-other-day huddles including engineering, product, and design.
- Keep the huddle focused on shipped increments, customer feedback, and blockers.
Create guardrail review gates.
- ☐ Unit tests.
- ☐ E2E tests.
- ☐ Linters/typecheck.
- ☐ PR prerequisites.
- ☐ Feature flags.
- ☐ Agentic review for style/naming/consistency.
- ☐ Human review for system design, security, and product feel.
- CI service-container docs for reproducible dependencies: https://docs.github.com/en/actions/use-cases-and-examples/using-containerized-services/about-service-containers
Build a QA agent after staging deploy.
- Input: tickets + acceptance criteria + deployed staging URL.
- Output: pass/fail evidence, screenshots, reproduction steps, proposed fixes.
- Use browser automation or Playwright in CI where possible.

Evaluation criteria

Customer quality score / CSAT for the shipped features.
Deployment frequency and rollback rate.
Lead time from spec approved to production.
Escaped defects and security issues.
Rework rate after stakeholder review.
Human hours spent coordinating vs building/reviewing.
Token cost per accepted ticket.

Integration cautions

Do not extrapolate from two strong engineers to an entire engineering org without controls.
The talk’s “25x deploys” and “10x output” numbers are case-study metrics, not general laws.
Eliminating standups can hide misalignment unless replaced by real product/design/customer feedback.
Agent-written code can be too large, shortcut-prone, or hard to review; keep PRs small and deterministic.

Core thesis

Spitz argues that once AI agents become the implementation bottleneck-breaker, traditional engineering rituals designed around human coordination — especially Scrum ceremonies — should be reconsidered. PFF’s case study claims two strong engineers using agents delivered a high-output product initiative faster than a larger traditional team, with fewer ceremonies and heavier reliance on specs, LDDs, automated tickets, review gates, and customer feedback.

The operational pattern is useful. The headline metrics are undercontrolled and should be treated as an existence proof, not a benchmark.

Comment insights

The comments provide exactly the caveats the talk needs:

Several viewers praised the talk as clear and pragmatic.
A top question asks how alignment happens without meetings. This is the core risk: product/design alignment does not disappear just because standups do.
Multiple comments joked about the speaker’s many browser tabs; not analytically important, but visually noticed by the audience.
Critical commenters argue Scrum is not the bottleneck, and that a two-engineer “crack team” naturally beats a larger team. This is a valid threat to the case study’s causal interpretation.
A strong negative comment warns of security issues from thousands of lines generated from incomplete docs. The speaker replies that guardrails and stakeholder feedback exist but were not fully covered due time.
Another comment points out that replacing 15-minute standups with 30–60 minute huddles every other day is not “no meetings.” Correct: it is a shift from status ritual to feedback review.

Practical synthesis: keep the spec/LDD/QA loop; be skeptical of the anti-Scrum rhetoric and measure outcomes.

Deep research

Supporting sources:

Claude Code overview confirms coding agents can work across files/tools, automate development tasks, and integrate with CLI/IDE/web/GitHub workflows. This supports the feasibility of the PFF workflow, though not its metrics. Source: https://docs.anthropic.com/en/docs/claude-code/overview
Claude Code Skills docs support the idea of encoding team-specific engineering patterns into reusable skills. Source: https://docs.anthropic.com/en/docs/claude-code/skills
Claude Code hooks/security docs support the need for guardrails around tool use, permissions, and command approval. Sources: https://docs.anthropic.com/en/docs/claude-code/hooks and https://docs.anthropic.com/en/docs/claude-code/security
GitHub Actions service containers show a standard way to run reproducible integration dependencies in CI, useful for deterministic review gates. Source: https://docs.github.com/en/actions/use-cases-and-examples/using-containerized-services/about-service-containers
Intercom’s “Shipping is your company’s heartbeat” post supports the broader claim that frequent shipping can improve feedback loops when deployment is low-cost and monitored. Source: https://www.intercom.com/blog/shipping-is-your-companys-heartbeat/

Contradicting/limiting evidence:

No external source verifies PFF’s internal 25x deploy, 10x output, 2.7x capacity, or 8.6/10 customer quality numbers.
The case study compares a two-person team of top engineers with a larger team; team size, selection bias, project type, and management attention are confounders.
Agile/Scrum is partly about alignment and feedback, not just standup/status. The talk underclaims the need for human product judgment and overclaims the death of rituals.

Verdict

Claim: AI-augmented engineers delivered much faster in the PFF case study.

Verdict: Plausible but not independently proven.
Confidence: Medium.
Practical takeaway: run your own controlled pilot with comparable tasks and outcome metrics.

Claim: Scrum did not survive because rituals designed for humans do not work for agents.

Verdict: Mixed.
Confidence: Medium.
Agree: status ceremonies should change when agents update tickets and implementation cycles shorten.
Disagree/overclaim: alignment, prioritization, and retrospective learning still matter. They may need different rituals, not zero rituals.

Claim: Start with your strongest/system-knowledge engineers.

Verdict: Agree.
Confidence: High.
Practical takeaway: agents amplify judgment. Experts are better at writing specs, spotting shortcuts, and building reusable skills.

Claim: Agentic code review should handle style/naming and humans should focus on big picture.

Verdict: Agree with caveats.
Confidence: Medium-high.
Practical takeaway: automate low-emotional, deterministic review feedback; keep humans for architecture, security, product feel, and ambiguous tradeoffs.

Claim: QA agents can eventually self-heal failed acceptance criteria.

Verdict: Mixed / promising but risky.
Confidence: Medium.
Practical takeaway: start by producing evidence-backed QA reports; only auto-create fix PRs for low-risk failures with strong tests.

Screen-level insights

0:37 title slide: “Agents don’t do stand-ups / Building the post-engineering org.” This visual frames the talk as organizational redesign, not just tooling.
3:40 results slide: Shows “Results: AI augmented engineering” with 16w without AI vs 7w actual, 1 engineer freed at week 3.5, 2.7x capacity multiplier, and 8.6/10 average quality across 5 features. This is the central evidence slide, but it is internal/case-study evidence.
5:12 “Scrum didn’t survive” slide: The subtitle says rituals designed for humans do not work for agents. The visual matters because it marks the talk’s shift from productivity metrics to process redesign.
10:18 “Guardrail architecture: agent specific review gates” slide: Shows “Verifiable, deterministic tasks” and examples like unit tests, E2E tests, linters, PR prerequisites. This is the practical corrective to the talk’s risky velocity claims: guardrails are the safety mechanism.

Visible UI/tools: Google Slides/browser, agent-generated specs/LDDs/tickets/PRs discussed in transcript, feature flags, analytics generation, agentic code review, QA agent, staging deploys, deterministic CI checks.

Verification notes

Verification passes performed:

Source/evidence audit: Cross-checked agent workflow feasibility against Claude Code docs, hooks/security docs, GitHub Actions CI docs, and Intercom’s shipping-feedback argument. External evidence supports methods, not PFF’s private metrics.
Transcript/comment/frame fidelity audit: Matched the analysis to transcript sections on 25x deploys, 10x output, huddles, spec/LDD/ticket flow, guardrails, QA agents, and comments challenging methodology.
Hallucination/overclaim audit: Treated PFF numbers as unverified case-study claims and highlighted confounders: two top engineers, smaller team, project selection, and metric design.
Actionable Insights audit: The top section provides a pilot workflow, checklists, evaluation criteria, commands/source links, and concrete cautions rather than generic “use AI” advice.

Residual uncertainty: there is no public dataset or control group for the case study; comments suggest important methodological concerns that remain unresolved.

Actionable Insights audit: expanded to the newer detailed format with fuller implementation notes, evaluation checks, and cautions where the existing evidence supports elaboration.