Context Engineering: The Discipline That Replaced Prompt Engineering
Why prompt engineering stopped being enough, the four core strategies (write - select - compress - isolate), and the difference between memory and context.
Context engineering is the practice of dynamically assembling everything an LLM sees at inference time - instructions, retrieved facts, tools, memory, state - so the model can complete a task. The term was popularized by Andrej Karpathy and Shopify CEO Tobi Lutke in June 2025 and has now replaced "prompt engineering" as the standard vocabulary for building production LLM applications. The four core strategies - write, select, compress, isolate - are now a shared framework across LangChain, Anthropic, Windsurf, and Manus. This post explains what context engineering is, why prompt engineering alone stopped working, and what it means for the memory and agent-infrastructure stack in 2026.
The definition, in Karpathy’s words
Context engineering is the delicate art and science of filling the context window with just the right information for the next step.
Tobi Lutke, one day earlier: "I really like the term 'context engineering' over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM."
Why prompt engineering stopped being enough
In 2022-2024, "prompt engineering" meant crafting the one-shot instruction that made GPT-3.5 or early GPT-4 produce a useful answer. Write a clever system prompt. Give it a few-shot example. Hope for the best.
This worked when tasks were one-turn question-answering. It fell apart the instant tasks became multi-step and agentic. Galileo’s deep dive makes the shift explicit: "A typical agent task requires around 50 tool calls, accumulating massive contexts that can make or break system performance." (Galileo)
When an agent makes 50 tool calls in a single trajectory, each call accumulates history into the context window. The problem stops being "write the right prompt" and becomes "manage the window." That is a systems problem, not a wordsmithing problem.
The four core strategies of context engineering
LangChain’s framework, which has become the de facto standard, identifies four operations (LangChain Blog):
1. Write - save information outside the context window so the agent can reference it later
The canonical pattern is the scratchpad. Anthropic’s multi-agent researcher uses this: "The LeadResearcher begins by thinking through the approach and saving its plan to Memory to persist the context, since if the context window exceeds 200,000 tokens it will be truncated and it is important to retain the plan." Scratchpads can be files, tool calls, or runtime state fields.
2. Select - pull relevant context into the window when needed
This is where memory layers live. ChatGPT maintains a separate memory store of user facts and retrieves relevant memories based on conversation similarity. Cursor gives users and agents explicit control over which files get loaded. The selection mechanism depends on the store: tool-based scratchpads are read via tool calls; memory systems are read via retrieval queries.
3. Compress - summarize or restructure accumulated history to fit more into the window
Manus treats the file system as infinite memory: agents write intermediate results to files and load only summaries into context. Full content remains accessible via file paths, achieving high compression with recoverability.
4. Isolate - give different agents access to different slices of context
Multi-agent systems are a form of isolation. The research agent sees research context. The writer agent sees writing context. The reviewer agent sees review criteria. Shared memory at the organizational layer keeps them coherent; isolated context at the operational layer keeps them focused.
Memory vs context - the distinction that determines your architecture
This is the most misunderstood concept in agent design. Galileo’s framing is clean: "Memory is your agent’s long-term storage. It is the persistent information stored externally that survives beyond individual interactions. It’s unlimited in size, cheap to store, but requires explicit retrieval to be useful."
Translation: memory is the hard drive. Context is the RAM. Memory doesn’t directly influence the model unless actively loaded into context. The retrieval step - moving data from memory into context - is where most systems fail.
Windsurf’s engineering team documented this: "Simple embedding search breaks down as memory grows. They evolved to a multi-technique approach combining semantic search, keyword matching and graph traversal. Each method handles different types of query." (Galileo)
The production examples - how ChatGPT, Claude Code, and Manus actually do it
- ChatGPT: separate memory store of user facts and preferences. Retrieves relevant memories based on conversation similarity. Loads only pertinent memories into context for each turn.
- Claude Code: uses working memory (context) for active task state, with project files as persistent memory. No vector database. Uses grep, file-tree traversal, and explicit file reads.
- Manus: file system as infinite memory. Agents write intermediate results to files and load only summaries into context. High compression + full recoverability.
The pattern is consistent: external persistent storage + intelligent selection + compression + per-agent isolation.
The benchmarks that measure this
LongMemEval (arXiv 2410.10813) evaluates five context-engineering-adjacent capabilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. State-of-the-art commercial systems score 30-70% on a setting much simpler than the full LongMemEval-S benchmark. The gap between the benchmark ceiling and real-world performance is where context engineering lives.
LoCoMo (Maharana et al., 2024) is the older benchmark, now criticized for relying on 32K-context-era assumptions. Hindsight’s AMB criticism lands: "Both datasets come from an era of 32K context windows, when fitting a long conversation into a single model call wasn’t possible." With million-token context windows now available, a naive "dump everything into context" approach scores competitively on these older benchmarks - which means the benchmarks themselves are now measuring the wrong thing.
What "context engineering" means for the memory-layer market
Context engineering reframes the memory market. The question is no longer "which memory layer has the best retrieval?" It is "which memory layer is easiest to integrate into a context-engineering pipeline?"
That changes what matters:
- Rich schemas beat flat key-value stores. Agents need typed facts with evidence, confidence, and provenance - not just strings.
- Proactive push beats reactive pull. If the agent has to know what to ask for, the memory layer is losing context-engineering battles it could win.
- Per-agent scoping is table stakes. The isolation strategy requires the memory layer to serve different slices to different agents from the same underlying graph.
- Audit trails matter more than benchmark scores. Context that can be traced back to a specific signal with a timestamp is context that can be governed. Anonymous embeddings cannot.
GeniOS is built as context-engineering infrastructure. Rich typed schemas (not flat strings). Proactive push via webhooks and SSE (not just reactive pull). Per-agent scoping as a first-class primitive. Full audit trail WORM-backed with S3 Object Lock. The rest of the memory market will get here. The ones that get here first are the ones that treated context engineering as an architectural principle, not a talking point.
What is context engineering?
The discipline of dynamically filling an LLM’s context window with the right information at each step of an agent’s trajectory. Popularized by Andrej Karpathy and Tobi Lutke in June 2025.
How is context engineering different from prompt engineering?
Prompt engineering is writing a single good instruction. Context engineering is assembling the entire information environment an LLM sees - instructions, retrieved facts, tools, memory, state - at every step.
What are the four core strategies of context engineering?
Write (save outside the window), Select (pull in when needed), Compress (summarize accumulated history), Isolate (give each agent its own slice).
What is the difference between memory and context?
Memory is persistent external storage. Context is the active content inside the LLM’s window for a specific inference call. Memory doesn’t influence the model unless actively loaded into context. The retrieval step is where most systems fail.
What benchmarks measure context engineering?
LongMemEval and LoCoMo are the standards. Both have known limitations - LoCoMo was designed for 32K-context-era systems; LongMemEval covers conversation but not agentic multi-tool workflows. Vectorize’s Hindsight AMB is the 2026 attempt to fix this.
What is a Context Layer for AI agents?
A Context Layer for AI agents is the infrastructure layer responsible for assembling, scoring, and injecting the right organizational context into each agent call. It sits above the Memory Layer (which stores facts) and below the agent runtime (which executes tasks). GeniOS is a production-grade Context Layer.