How we solved Context Management in Agents — Sally-Ann Delucia

AI Engineer16:16Transcript ✅Added May 18, 4:40 pm GMT+8

Actionable Insights

Separate conversational state from durable memory Treat raw chat/trace history as evidence, not memory. Store durable facts, plans, tool results, and user-visible decisions in structured state that can be retrieved independently. Start by turning this into a small, reversible pilot: write down the exact input, expected output, owner, and success metric before changing the wider workflow. The useful detail from the analysis is: - Common agent architectures already separate scratchpads, plans, memory stores, retrieved evidence, and current turn context; this matches the talk’s “context versus memory” point. Another comment notes that all compression is lossy; that is the important caveat: there is no magic compression layer, only explicit tradeoffs about what evidence to preserve. Treat the first run as an evaluation, not a migration: capture before/after examples, note where the method saves time or improves quality, and keep the old path available until the new one passes repeated checks. Watch for the main failure mode here: overgeneralizing the creator’s demo beyond the evidence. If the video or comments only showed a narrow case, keep the rollout narrow and require fresh proof before broad adoption.
Design context selection as product UX Before optimizing tokens, decide what the agent should remember, forget, ask about, or surface to the user. Alex failed when follow-up questions lost the prior referent; that is a UX failure, not just a token-limit failure. Start by turning this into a small, reversible pilot: write down the exact input, expected output, owner, and success metric before changing the wider workflow. The useful detail from the analysis is: Delucia argues that context engineering is the central product problem in serious agent systems: the agent must remember what matters and forget what does not. The talk’s most useful framing is that context management is not only an engineering optimization; it controls the user experience and the agent’s ability to reason over prior work. Treat the first run as an evaluation, not a migration: capture before/after examples, note where the method saves time or improves quality, and keep the old path available until the new one passes repeated checks. Watch for the main failure mode here: overgeneralizing the creator’s demo beyond the evidence. If the video or comments only showed a narrow case, keep the rollout narrow and require fresh proof before broad adoption.
Avoid naive truncation and generic summarization as the only controls Truncation broke follow-ups; summarization was inconsistent. Build policy-driven context packs: current objective, relevant traces/spans, active plan, prior user constraints, and explicit omissions. Start by turning this into a small, reversible pilot: write down the exact input, expected output, owner, and success metric before changing the wider workflow. The useful detail from the analysis is: The talk’s most useful framing is that context management is not only an engineering optimization; it controls the user experience and the agent’s ability to reason over prior work. Her team’s Alex agent ran into the classic loop where the same traces/spans it needed to analyze also overwhelmed the context window. Treat the first run as an evaluation, not a migration: capture before/after examples, note where the method saves time or improves quality, and keep the old path available until the new one passes repeated checks. Watch for the main failure mode here: overgeneralizing the creator’s demo beyond the evidence. If the video or comments only showed a narrow case, keep the rollout narrow and require fresh proof before broad adoption.
Use scoped subagents for trace-heavy work For long trace/span analysis, send bounded slices to specialist agents or tools, then return structured findings. Measure whether the parent gets enough evidence to answer follow-ups. Start by turning this into a small, reversible pilot: write down the exact input, expected output, owner, and success metric before changing the wider workflow. The useful detail from the analysis is: - “Use subagents” only works if the handoff schema preserves evidence and uncertainty; otherwise it just hides lost context behind another model call. Her team’s Alex agent ran into the classic loop where the same traces/spans it needed to analyze also overwhelmed the context window. Treat the first run as an evaluation, not a migration: capture before/after examples, note where the method saves time or improves quality, and keep the old path available until the new one passes repeated checks. Watch for the main failure mode here: overgeneralizing the creator’s demo beyond the evidence. If the video or comments only showed a narrow case, keep the rollout narrow and require fresh proof before broad adoption.
Add eval cases for context continuity Test: ask a question, ask a follow-up using a pronoun or prior label, then ask for pattern analysis across traces. Fail the build if the agent forgets the referent or invents missing context. Start by turning this into a small, reversible pilot: write down the exact input, expected output, owner, and success metric before changing the wider workflow. The useful detail from the analysis is: This is the key scale pressure: one trace is already large; pattern analysis across traces multiplies context. Her team’s Alex agent ran into the classic loop where the same traces/spans it needed to analyze also overwhelmed the context window. Treat the first run as an evaluation, not a migration: capture before/after examples, note where the method saves time or improves quality, and keep the old path available until the new one passes repeated checks. Watch for the main failure mode here: overgeneralizing the creator’s demo beyond the evidence. If the video or comments only showed a narrow case, keep the rollout narrow and require fresh proof before broad adoption.

Core thesis

Delucia argues that context engineering is the central product problem in serious agent systems: the agent must remember what matters and forget what does not. Her team’s Alex agent ran into the classic loop where the same traces/spans it needed to analyze also overwhelmed the context window. The talk’s most useful framing is that context management is not only an engineering optimization; it controls the user experience and the agent’s ability to reason over prior work.

Comment insights

The comments are mixed. Positive viewers appreciated the presentation, but several expected more novelty and said the advice restated known context-engineering problems. The most useful practitioner comment described a home-lab cluster of narrow agents communicating through an email server, which supports the talk’s subagent/scope theme. Another comment notes that all compression is lossy; that is the important caveat: there is no magic compression layer, only explicit tradeoffs about what evidence to preserve.

Deep research

Supporting sources and concepts:

Andrej Karpathy’s “context engineering” framing supports the move from prompt-only thinking to explicit context curation.
Common agent architectures already separate scratchpads, plans, memory stores, retrieved evidence, and current turn context; this matches the talk’s “context versus memory” point.
Observability/tracing systems such as Arize/Phoenix, OpenTelemetry traces, and LangSmith-style runs show why raw spans become too large for direct LLM ingestion.

Limiting evidence:

The talk evidence is mainly product experience from Alex, not a public benchmark.
The extracted comments correctly warn that summarization/compression can lose critical details.
“Use subagents” only works if the handoff schema preserves evidence and uncertainty; otherwise it just hides lost context behind another model call.

Verdict

Context management is a product problem: Agree, high confidence. User follow-ups and agent continuity depend on what the system chooses to carry forward.
Naive truncation is insufficient: Agree, high confidence. The transcript gives direct failure examples where the agent forgot what “input B” referred to.
Summarization alone solves the issue: Disagree. The talk itself says summarization was inconsistent and uncontrolled.
Subagents help escape context limits: Mixed, medium confidence. They can help if tasks are decomposed and outputs are structured; they can also lose evidence.

Screen-level insights

0:07 title/opening: Establishes the talk is about context windows, Alex, and lessons from building an agent for close to a year.
1:08 Alex product screenshot: Shows Alex as an AI harness with planning, 40-plus skills, prompt optimization, data generation, data augmentation, and annotations. This matters because the context problem comes from a real multi-workflow product surface.
3:08 trace/span framing: The nearby transcript describes prompts, metadata, user interactions, and large trace data. This is the key scale pressure: one trace is already large; pattern analysis across traces multiplies context.
4:10–4:40 vicious loop: The slide/transcript describes Alex analyzing traces, spans growing, context limits being hit, and needing a strategy. This is the concrete failure mode the analysis should preserve.

Verification notes

Verification passes performed: source/evidence audit against the extracted transcript and comments; fidelity audit tying claims to the trace/context examples; hallucination audit to avoid claiming benchmarks or proprietary implementation details; Actionable Insights audit to convert the talk into a concrete context-management workflow. Residual uncertainty: the extraction does not include full product internals or measured eval results for Alex.

Actionable Insights audit: expanded to the newer detailed format with fuller implementation notes, evaluation checks, and cautions where the existing evidence supports elaboration.