Company Intelligence ·Apr 3, 2026 ·10 min read

Context Engineering: The Tactical Playbook (2026 Patterns)

Q: What is the most important context engineering pattern?

Progressive disclosure and scratchpad persistence are the two most-adopted. Skipping either usually produces token-cost problems by month 3 of production.

Q: How do I budget context?

Start with explicit percentages: 10% system, 15% tools, 50% retrieved content, 20% recent history, 5% scratchpad summary. Adjust from there based on benchmarks.

Q: Why does prompt caching matter?

It cuts repeated-prefix cost by ~80% for stable components. For a system serving 1M API calls per month, this can be the difference between a profitable product and an unprofitable one.

Six production patterns from Anthropic, Cognition, Manus, Windsurf, and Microsoft that separate teams that ship reliable agents from teams that don’t.

TL;DR

Context engineering is the discipline; this post is the playbook. Six production patterns separate teams that ship reliable agents from teams that don’t: (1) progressive disclosure for skills and tools, (2) scratchpad persistence outside the context window, (3) session-bounded compression, (4) per-agent scope isolation, (5) context-aware caching, (6) explicit context budgeting. Every pattern is documented in production code from Anthropic, Cognition, Manus, Windsurf, or Microsoft. This post is the tactical how-to - assuming you already know what context engineering is.

Pattern 1 - Progressive disclosure

Problem: loading all available tools and skills into context wastes tokens on things the agent won’t use.

Pattern: multi-level loading. At Level 0 the agent sees a catalog of available capabilities (names + one-line descriptions, ~3,000 tokens). At Level 1, when the agent decides to use a specific capability, it loads the full content.

Production example: Hermes Agent’s skill system. Skills in `~/.hermes/skills/` are Markdown documents. Level 0 catalog is presented to the agent; individual skills load on demand.

Anti-pattern: loading every skill into every prompt, turning the system prompt into a 50K-token dump.

Pattern 2 - Scratchpad persistence outside the context window

Problem: long-running agents hit context-window limits and lose important state.

Pattern: save information outside the window so the agent can reference it later. The scratchpad can be a file, a database record, or a runtime state field.

The LeadResearcher begins by thinking through the approach and saving its plan to Memory to persist the context, since if the context window exceeds 200,000 tokens it will be truncated and it is important to retain the plan.

Anthropic - Multi-Agent Researcher

Production example 2: Manus treats the filesystem as infinite memory - agents write intermediate results to files and load only summaries back into context.

Pattern 3 - Session-bounded compression

Problem: accumulated conversation history grows linearly and compounds over multi-hour sessions.

Pattern: periodically compress older conversation turns into summaries while keeping recent turns verbatim. The compression threshold is tunable - smaller windows compress more aggressively.

Production example: ChatGPT’s memory system. Separate memory store of user facts; loads only pertinent memories into context for each turn.

Production example 2: Cognition’s Devin engineering fix - enabled 1M-token context beta but capped actual usage at 200K to prevent "context anxiety" where the model senses the window limit and takes shortcuts.

Pattern 4 - Per-agent scope isolation

Problem: in a multi-agent fleet, every agent seeing the full context pollutes each agent’s reasoning with irrelevant detail.

Pattern: each agent gets a scoped slice of the shared context based on its role. Research agent sees research context. Writer agent sees writing context. Reviewer agent sees review criteria.

Production example: Genios’s per-agent scoping primitive. Every `/v1/context` call carries an `agent_id`, and the returned context bundle is filtered to what that agent has permission and need to see.

Anti-pattern: dumping the full shared graph into every agent’s context. Drives token cost up and signal-to-noise ratio down.

Pattern 5 - Context-aware caching

Problem: many context components (system prompts, tool descriptions, stable facts) don’t change between calls - but get re-sent every time.

Pattern: use prompt caching (Anthropic’s `cache_control: {"type":"ephemeral"}` or OpenAI’s equivalent) for stable prefixes. Cache system prompts, tool schemas, and long stable context sections.

Production example: Genios’s extraction pipeline caches the extraction system prompt across batched signals. Cost reduction: ~80% on the cached prefix.

Pattern 6 - Explicit context budgeting

Problem: teams optimize prompts instead of optimizing what-goes-in-the-window. The window fills up with whatever retrieval returned.

Pattern: define a token budget per category - system prompt (fixed X%), tool descriptions (Y%), retrieved context (Z%), scratchpad (W%), recent turns (V%). Retrieval is explicitly limited.

Production example: Genios’s three context tiers (Short <= 800 tokens, Medium <= 1,800 tokens, Long <= 2,400 tokens). The caller picks the tier based on the task; retrieval fills the tier budget with the highest-scoring content.

The meta-pattern - memory vs context discipline

Every pattern above depends on one decision: what lives in memory vs what lives in context. Galileo’s framing: "Memory is unlimited in size, cheap to store. Context is limited in size, expensive per token." (Galileo)

The discipline

Treating them the same is the root cause of most context-engineering failures. Persist everything in memory. Curate carefully into context. This is the discipline.

What is the most important context engineering pattern?

Progressive disclosure and scratchpad persistence are the two most-adopted. Skipping either usually produces token-cost problems by month 3 of production.

How do I budget context?

Start with explicit percentages: 10% system, 15% tools, 50% retrieved content, 20% recent history, 5% scratchpad summary. Adjust from there based on benchmarks.

Why does prompt caching matter?

It cuts repeated-prefix cost by ~80% for stable components. For a system serving 1M API calls per month, this can be the difference between a profitable product and an unprofitable one.

Book a call More writing