Agentic Infrastructure ·Apr 6, 2026 ·11 min read

Harness Engineering: The Discipline That’s Eating Prompt Engineering

Harness Engineering: OpenAI named it, Martin Fowler formalized it, Anthropic and Microsoft adopted it. The two-part framework - guides + sensors - and what to build first.

TL;DR

Harness engineering is the practice of designing the entire environment an AI agent operates in - the feedforward guides that steer it before acting, the feedback sensors that validate it after acting, and the feedback loops that close the gap between the two. The term was formalized by OpenAI’s Codex team in February 2026, extended by Martin Fowler’s team (Birgitta Bockeler’s "guides + sensors" framework), and adopted across Anthropic, Datadog, Microsoft, and Milvus. On SWE-bench, the same model scores 20-30 percentage points differently across harness implementations - the model is no longer the bottleneck.

The definitional shift

We estimate that we built this in about 1/10th the time it would have taken to write the code by hand. Humans steer. Agents execute. We intentionally chose this constraint so we would build what was necessary to increase engineering velocity by orders of magnitude.

OpenAI - Harness Engineering, Feb 11, 2026

The shift is this: prompt engineering asked "what do I say to the model?" Context engineering asked "what does the model see?" Harness engineering asks "what environment does the agent operate in?"

A harness is everything around the model: tools, knowledge sources, validation logic, architectural constraints, feedback loops, memory, and lifecycle management. Prompt engineering and context engineering are components inside the harness. The harness contains and orchestrates them alongside all other agent subsystems. (Miraflow)

For the model selection layer above the harness, see Best LLM for AI Agents in 2026.

The two-part framework - guides and sensors

Birgitta Bockeler’s framework (martinfowler.com, 2026) is the canonical taxonomy.

Guides (feedforward controls)

They steer the agent before it acts. Examples:

`AGENTS.md` / `CLAUDE.md` files documenting project norms.
System prompts with architectural constraints.
Coding conventions and style rules.
Pre-execution plans and specifications.

Sensors (feedback controls)

They observe and validate the agent’s behavior after it acts. Examples:

Linters and type checkers.
Test suites (unit, integration, e2e).
Output parsers.
Evaluation loops (LLM-as-judge, trajectory analysis).
Production telemetry.

A harness "acts like a cybernetic governor, combining feed-forward and feedback to regulate the codebase towards its desired state." (martinfowler.com)

The evidence for harness primacy

SWE-bench variance. Same model scores differ by 20-30 percentage points across harness implementations. Teams treating model choice as the primary reliability variable are measuring the wrong thing. (MorphLLM)
Microsoft Azure SRE agent. Shifted from 100+ bespoke tools and a prescriptive prompt to a filesystem-based context system. Letting the agent use `readfile`, `grep`, `find`, and shell outperformed specialized tooling - "Intent Met" score rose from 45% to 75% on novel incidents. (Miraflow)
Anthropic three-agent harness. Solo agent cost $9 with broken output. Three-agent harness (Planner/Generator/Evaluator) cost $200 with functional output. 20x cost for working product.
Microsoft SRE deployment. Handled 35,000+ production incidents, dropping time-to-mitigation from 40.5 hours to 3 minutes.

The six components of a modern harness

01 Guide files - `AGENTS.md`, `CLAUDE.md` at the project root, documenting architecture, commands, conventions.
02 Tool interface - clean, well-typed tool schemas. Naming and error surfaces matter. (See Anthropic’s "Writing Effective Tools for Agents.")
03 Context system - memory layer, retrieval, summarization, progressive disclosure.
04 Sensors - linters, type checkers, tests, parsers, evaluators.
05 Sandbox - Docker, SSH, Modal - network-isolated execution. Not optional after Opus 4.6 inferred it was being evaluated and decrypted the answer key on BrowseComp.
06 Observability - distributed tracing, trajectory capture, failure attribution.

Skip any of these and the agent becomes unreliable in a different dimension.

The three starting points for a new harness

From Miraflow’s practical guide:

01 Create a guide file (`CLAUDE.md` or `AGENTS.md`) at the project root, documenting structure, build commands, and rules. Add a new rule every time the agent repeats a mistake.
02 Wire up computational sensors - pre-commit hooks running linters and type checkers on every change.
03 Close the feedback loop - the agent runs tests after making changes and attempts fixes before declaring success.

"You don’t need to build every mechanism at once; these three starting points typically produce the fastest practical return."

Where GeniOS fits inside a harness
A GeniOS-powered agent uses the context layer of the harness differently from the traditional setup. Instead of a flat retrieval step, the agent has a typed graph of organizational facts (Section A), proactive push of recommendations when the graph changes (Section B), and full audit trail of what the agent saw, why, and with what confidence. The harness still wraps the agent - guides, sensors, sandbox, observability. GeniOS replaces the retrieval-and-memory slice of the harness with a smarter layer.

What is harness engineering?

The practice of designing the entire environment around an AI agent - guides (feedforward), sensors (feedback), context systems, sandboxes, observability - with the goal of making the agent reliable in production.

Who coined harness engineering?

The term was popularized by OpenAI’s February 2026 "Harness Engineering" blog post, with the guides/sensors framework formalized by Birgitta Bockeler on martinfowler.com.

Is harness engineering the same as prompt engineering?

No. Prompt engineering is a single component inside the harness. The harness includes tools, context, memory, validation, sandbox, and observability.

What is Harness Engineering for AI agents?

Harness Engineering is the discipline of building the execution scaffold around an AI model - the context system, tool interface, memory layer, retrieval pipeline, and evaluation harness - that determines agent reliability in production. SWE-Bench Pro research shows 22+ point performance variance across harnesses on identical models.

Book a call More writing