BDD, ADR, PRD, WTF: Capturing Decisions for Humans and AI Alike — Michal Cichra, Safe Intelligence

AI Engineer12:49Transcript ✅Added Jun 7, 1:51 am GMT+8

Actionable Insights

Turn ADRs into enforceable guardrails, not archive docs. Start by adding docs/adr/0001-no-orm-in-templates.md with status, context, decision, consequences, enforced_by, and file_patterns, then wire the rule to a tool such as import-linter or a custom lint check. Cichra’s concrete example bans ORM access from template/rendering layers to prevent hidden N+1 queries; the key move is linking the lint failure back to the ADR so humans and agents learn the reason, not just the prohibition. Evaluate this by deliberately adding a forbidden import and confirming both local hooks and CI reject it with a helpful link.
Keep PRDs lightweight but explicit enough to survive context loss. For each feature, create a short docs/prd/<feature>.md with: problem, goal, user journey, non-goals, affected surfaces, and links to BDD scenarios and ADRs. Cichra argues the document is for agents and for “you six weeks from now”; do not turn it into a massive spec unless the domain requires it. The first test is whether a new engineer or coding agent can answer “why does this flow exist?” without asking the founding engineer.
Use executable BDD for critical journeys where plain specs are too weak. Cichra recommends Cucumber and Gherkin because scenarios are readable and executable. A starting pattern: features/checkout_happy_path.feature with Given/When/Then steps tied to PRD journey IDs, implemented through browser-level tests that cannot import database modules. Evaluate by asking a reviewer to read only the feature file and tell you whether the intended behavior is clear; caution: Gherkin itself does not prevent spec drift, so pair it with ownership and traceability.
Document UI language as rules agents can see and tests can enforce. Add a design-system page or Storybook-like preview catalog for components: button variants, states, one-primary-button-per-page rule, allowed spacing/color tokens, and forbidden inline styles. Cichra’s point is that agents need visible, reusable patterns just as humans do. A first experiment is to add visual/component previews and a linter that blocks raw hex colors or inline styles outside the component layer; success means AI-authored UI reuses components instead of inventing near-duplicates.
Use one enforcement loop across humans and agents: hooks → CI → focused feedback → retry. Put formatting, type checks, architecture checks, BDD tests, document linting, and duplication checks behind the same commands in pre-commit/pre-push hooks and CI. The transcript emphasizes that agents must use git to deliver PRs, so failed hooks become the feedback channel. Keep failures concise and link each rule to an ADR/PRD; evaluate by tracking whether rejected commits are fixed without human re-explaining the rule.
Add a drift audit before trusting the system. The strongest commenter caveat is specification drift: ADRs, PRDs, BDD files, and code can diverge. Add a weekly or PR-level checklist: changed code has matching ADR/PRD/BDD updates; BDD scenarios still exercise the declared journey; deprecated ADRs are marked superseded; agent-generated doc changes are reviewed by a human. Microsoft’s “LLMs Corrupt Your Documents When You Delegate” work and the commenter’s Parnas/Dijkstra references are a useful warning: documents are not truth unless maintained against executable evidence.

Core thesis

AI makes old software-memory problems show up faster. Teams already forgot why flows, features, and architecture exist; LLM agents also lose context, compact conversations, and have no durable memory unless decisions are written down and linked to enforcement. Cichra’s thesis is that ADRs, PRDs, BDD, design systems, hooks, CI, skills, and linters form a practical memory-and-feedback loop for humans and agents.

Big ideas / key insights

ADRs capture architectural intent and enforcement, not just decisions.
PRDs can be small: problem, goal, user journey, and why the feature exists.
BDD is valuable again because it can be both human-readable and executable.
Design systems are part of agent reliability: visual consistency needs documented components, previews, and rules.
Automation should remove low-level review debates: formatting, import boundaries, type checks, doc linting, and architecture rules belong in tools.
“What you cannot find, you cannot enforce”: rules need discoverable docs and machine-checkable hooks.

Best timestamped moments

1:09–1:40 — Limited context is the shared failure mode. The “monkeys and ladder” story is used to explain inherited behavior without rationale. It maps cleanly to both human turnover and LLM context loss.
2:10–3:13 — ADRs with enforcement. The N+1-query example is the most concrete technical section: split layers, return plain shapes from DB reads, ban ORM imports where they create hidden queries, and lint the architecture.
4:16–5:16 — BDD closes a spec-driven gap. Markdown specs can describe intended behavior, but Cucumber/Gherkin scenarios can be reviewed and executed.
6:18–6:48 — UI consistency requires a language. The primary-button example shows how visual rules become reusable components and reviewable previews.
7:18–9:25 — The enforcement loop. Git hooks and CI catch skipped checks, while failures link back to the decision document.
10:26–10:56 — Focused skills and tests. Different skills change loop focus: ADR lookup, PRD lookup, browser-first UI iteration, and coverage/file-change-based test selection.

Practical takeaways / recommended workflow

Create docs/adr, docs/prd, and features directories.
Use ADRs for durable architecture constraints; each ADR states how it is enforced.
Use PRDs for feature intent and user journeys; link PRD journeys to BDD scenarios.
Use Cucumber/Gherkin or equivalent readable executable specs for critical flows.
Add design-system previews and lintable UI rules for components, tokens, and composition.
Wire checks into pre-commit/pre-push hooks and CI with the same commands.
Make every automated failure point to the relevant ADR/PRD/BDD document.
Add a drift review: docs are updated when code changes, and executable checks prove the docs still correspond to behavior.

Comment insights

The comments split into three useful clusters. First, several viewers mostly reacted to distracting audio/door-hinge noise, so engagement with the technical content was partly diluted. Second, practitioners liked the framing: one commenter said BDD plus agentic skills and context engineering produced better results than plain spec-driven development, and another distilled the benefit as enabling long sessions to compact indefinitely because important knowledge is externalized. Third, the strongest technical pushback warned about specification drift: ADRs, BDD files, and code can fall out of sync; commenters cited Parnas, Dijkstra, Playwright’s lack of native Gherkin support, and Microsoft work on document corruption as reasons not to assume agent-reviewed docs stay correct.

Deep research on the main claims

Claim: ADRs are a good way to preserve architectural rationale. Support: adr.github.io traces ADR popularity to Michael Nygard’s 2011 “Documenting Architecture Decisions”; Nygard describes ADRs as short text files recording decision context and consequences. Martin Fowler also describes ADRs as short documents capturing decisions relevant to a product or ecosystem. Contradiction/caution: ADRs are only useful if discoverable and maintained; they do not enforce anything by themselves.
Claim: BDD/Cucumber can make specs readable and executable. Support: Cucumber’s documentation says Cucumber reads executable specifications in plain text and validates software against them; the Gherkin reference defines keywords that structure executable specifications. Contradiction/caution: the commenter’s point is fair: there is no evidence in the packet proving agents follow Gherkin better than EARS or Playwright-style test descriptions. Playwright can describe browser tests without Gherkin; Cucumber adds collaboration/readability but also glue-code overhead.
Claim: Enforced hooks/CI/linters improve agent adherence. Support: this is consistent with long-standing CI practice and with the transcript’s concrete import-boundary example. Deterministic checks are better than asking a probabilistic model to “remember” architecture. Contradiction/caution: checks cover only what is expressible and maintained; over-broad lint rules can block legitimate refactors or create rule fatigue.
Claim: Externalized docs help agents survive context compaction. Support: the transcript’s logic is plausible: durable files can be re-read after compaction, and agents can follow links from failures back to rationale. Contradiction/caution: Microsoft Research’s 2026 discussion of “LLMs Corrupt Your Documents When You Delegate” warns that delegated document maintenance can silently introduce errors; externalized docs need human review and executable tests.

Verdicts on major claims

ADRs + enforcement are valuable for humans and agents — agree, high confidence. Evidence: transcript gives a concrete ORM/template import rule; ADR practice is independently established by Nygard/Fowler/adr.github.io. Practical takeaway: do not stop at prose; connect ADRs to checks.
BDD/Cucumber is “suddenly useful again” for AI workflows — mixed, medium confidence. Cucumber is genuinely readable/executable and can bridge PRDs to tests. What is overclaimed: the packet does not prove Gherkin is superior to EARS, Playwright-native tests, or other structured acceptance criteria for agents. Practical takeaway: use BDD for high-value journeys where readability matters; do not impose it everywhere.
Design systems remain the way to get consistent UI from agents — agree, high confidence. The claim matches established UI engineering practice. Agents increase the need for reusable components and visual constraints because they otherwise invent variants.
Git hooks/CI/linters can keep agents consistent — agree with caveats, high confidence. Deterministic feedback is strong, but only for rules you can encode. Practical takeaway: automate style, boundaries, and invariants; keep architecture judgment in review.
This solves specification drift — disagree if implied, medium confidence. The talk acknowledges enforcement but does not fully solve doc-code divergence. Practical takeaway: add explicit drift audits and treat docs as versioned artifacts needing review.

Screen-level insights

0:38 frame — speaker identity and context. The slide shows “Michal Cichra,” Spec 27 and Safe Intelligence logos, and “Find me at the booth,” anchoring the talk in an AI engineering conference setting before the acronym-heavy content begins.
3:13 frame — concrete ADR example. The slide titled “ADR: what it looks like” shows a code-style ADR with fields including status: Accepted, enforced_by: import-linter, and file_patterns. The visible decision bans ORM queries in templates to prevent hidden N+1 queries. This matters because it demonstrates the talk’s core idea: docs become useful when tied to enforcement metadata and file scope.

My read / why it matters

This is one of the more practical agent-coding talks because it avoids claiming that better prompts alone solve consistency. The durable pattern is: write down intent, make it discoverable, encode what can be encoded, and force both humans and agents through the same feedback loop. The weak point is drift: every extra artifact is another thing that can lie. The best implementation is therefore small, enforceable, and audited.

Verification notes

Source/evidence audit: checked transcript, comments, frame descriptions, Cucumber docs, adr.github.io/Nygard/Fowler references, and Microsoft document-corruption discussion.
Transcript/comment/frame fidelity audit: claims about ADRs, PRDs, BDD, design systems, hooks, CI, import boundaries, context compaction, and drift are tied to transcript timestamps or comments; screen claims are limited to visible frame content.
Hallucination/overclaim audit: softened unproven claims about Gherkin superiority and agent self-review; marked specification drift as unresolved.
Actionable Insights audit: top bullets include first steps, tools/links, evaluation criteria, and cautions; no bullet depends only on unsupported transcript enthusiasm.
Residual uncertainty: no full slide deck or repo for Safe Intelligence Spec 27 was verified here; external claims about Microsoft’s paper are based on search-result summaries and Microsoft/blog references, not a full paper review.