The Next AI Bottleneck Is Not the Model. It Is the Harness.

Contents

What a harness actually is?

The three timescales of agent operation

The three hard problems

Why one-shot evaluation misses the point

What this means for builders

Frequently Asked Questions

For the past several years, the dominant story of AI progress has been a story of scale. Bigger models, more parameters, more data, more compute. The narrative implicitly promised that the path to more useful AI ran through better foundation models, full stop.

For agentic AI, that story is now incomplete. A new paper from UC Berkeley argues that the next major bottleneck in agentic systems is not the model itself but the structured execution layer wrapped around it: the harness. Once a foundation model is embedded into tools, terminals, browsers, repositories, memory stores, and external services, its behavior is no longer determined by the model. It is determined by the system the model lives inside.

This reframing matters more than it might sound. Almost everyone evaluating AI evaluates the model. The actual product, when an agent runs for hours or days against real systems, is the model and the harness together. If the harness is broken, a smarter model does not save you.

What a harness actually is?

An agentic system breaks down into six interacting components, and they show up in roughly the same shape across almost every production harness.

The reasoning substrate is the foundation model itself, or sometimes several models working in concert. This is the part of the system everyone already thinks about.

The memory store holds long-term information the agent needs across sessions: user preferences, project context, prior decisions, accumulated notes.

The context constructor builds and cleans up what actually gets sent to the model on any given turn. It decides what makes it into the context window and what stays out.

The skill-routing layer decides which capability the agent should invoke for each substep of a task.

The orchestration loop manages the sequence of operations, coordinating how the context constructor draws from memory to feed the model on each step.

The verification and governance layer acts as a gatekeeper. It manages permissions, audit trails, and rollbacks, and ensures that intermediate reasoning and external actions are checked before they affect the live environment or get written back to persistent memory.

Different harnesses are built for different purposes. Claude Code is a vendor-built coding tool. OpenClaw operates as a multi-channel personal assistant. CheetahClaws serves as an open-source reference implementation for research. The structural differences between them come from deployment priorities (enterprise reliability versus open-source reproducibility) rather than from differences in the underlying model.

The three timescales of agent operation

A useful way to think about a harness is in terms of three distinct timescales the agent operates on simultaneously.

Prompts operate at the shortest horizon. They define the immediate goal and tell the model what to focus on right now. They are precise but fragile. A carefully crafted prompt that works beautifully on one task often transfers badly to the next.

Skills operate at the task level. They are reusable execution patterns, predefined routines for searching a database, editing a configuration file, or summarizing a document. Skills make individual tasks reliable. They also introduce a new class of problem: when an agent has dozens of skills available, choosing which one to invoke for an ambiguous problem, and chaining several of them together, becomes a real engineering challenge in itself.

Memory operates at the longitudinal layer. It preserves facts across sessions, allowing an agent to remember a user’s preferences or a codebase’s architecture weeks after the first interaction. But memory is also where the hardest failures live. Information stored months ago can decay, become contaminated by later writes, or be applied too broadly to situations it does not actually fit.

💡 Key Insight

A reliable agent is one whose prompts, skills, and memory all do the right thing at the right timescale. Most failure modes in production agents are not model failures. They are failures at one of these three layers, or in the seams between them.

The three hard problems

The Berkeley analysis names three engineering problems that the next generation of agent systems has to solve. The framings are useful because each one turns conventional intuition on its head.

Context: governance, not capacity

The intuitive response to context limitations has been to expand the window. Million-token contexts are now common. The paper argues that capacity is the wrong frame entirely.

“The hard problem of context is not capacity, but governance.” Stuffing a million tokens of mixed, conflicting, or irrelevant material into a context window does not give the model more to work with. It gives the model more to get lost in. The condition the paper names is “context rot,” and it shows up in real deployments as a model that has all the information it needs but cannot find any of it under the noise.

The implication for harness design is that context assembly must function as a strict selection policy aimed at minimum sufficient context, the smallest amount of information that actually enables the task, rather than maximum included context.

Production systems already work this way. A recent architectural analysis of Claude Code described a five-tier compaction system that includes routines like “micro-compact” for cleaning up old tool results and “context collapse” for summarizing long dialogue spans. When a tool emits a massive text output, such as an endless server error log, the system avoids flooding the model. The full file is written to local disk and only an 8-kilobyte preview is supplied to the model. The model is forced to behave the way a human developer would: scan the top of the log first, dig deeper only if necessary.

Memory: trust, not storage

For memory, the conventional question is how much to store and where. The paper rephrases the problem entirely. The difficulty is not how much you store. It is whether what you store can be trusted at the moment it is used.

The failure mode that matters here is the stale-but-confident agent. Semantic search retrieves a high-ranking memory entry, such as a note about how a codebase is structured, and the agent acts on it. But the codebase was refactored yesterday. The memory is technically still there, retrievably relevant in the embedding space, and completely wrong in the world.

The defense is what the paper calls “skeptical memory.” A memory file like MEMORY.md is treated not as ground truth but as an unverified pointer, a hint. Before the agent takes any destructive action based on what the memory claims, it verifies the claim against the live environment. The memory entry is a starting point for inquiry, not a substitute for inquiry.

Long-term memory hygiene matters too. Production systems run background processes during idle time (autoDream is the name used in one such implementation) to resolve contradictions between entries, compress accumulated insights, and bound memory growth before it degrades the agent’s behavior.

Skills: routing and checking, not having them

For skills, the standard ambition has been to give the agent more of them. The paper reframes once more. The difficulty is not having a library of capabilities. It is knowing which one to invoke for a given subproblem, and verifying that what it returned actually does the job.

The failure here is the confident-but-unchecked output. A specialized routing branch produces a highly plausible answer that no one verified. In a multi-agent configuration, this is the classic failure mode of sub-agent orchestration. Each sub-agent produces something coherent. None of them produce something correct.

The fix is to tie skill selection directly to explicit post-condition checks. A skill is not just “the model invoked the right tool.” It is “the output of the right tool was verified to do what was claimed.”

→ What this means

Each of these three reframings replaces a quantitative target (more context, more memory, more skills) with a qualitative discipline (governance, trust, verification). The shift is what separates a working demo from a production system.

Why one-shot evaluation misses the point

Almost every benchmark for AI agents reports a single number: did the agent complete the task? That number, the paper argues, masks what actually matters for deployment.

An agent that brute-forces its way to a correct answer through a hundred retries has technically succeeded on the benchmark. In production, it has burned a hundred times the tokens, hit the API rate limit, and consumed an hour of compute for what should have taken a minute. The success metric tells you none of this.

Realistic evaluation has to integrate process metrics: trajectory hygiene, verification overhead, and context efficiency over extended runs. It has to measure whether the agent’s behavior is improving or degrading over repeated use, because memory and accumulated context can quietly turn a working agent into a confidently wrong one.

The point is sharper for multi-agent configurations. Parallel processing is impressive in a demo. Genuine collaboration requires a standardized communication layer for state sharing, contradiction spotting, and uncertainty reporting, none of which a one-shot benchmark exposes.

💡 Key Insight

The agent that wins a benchmark on success rate is not always the better agent to deploy. A two-line success metric hides token cost, retry behavior, latency, and the question of whether the system gets better or worse with use. Process metrics are what actually predict production behavior.

What this means for builders

Several practical implications follow.

If you are evaluating an agent, evaluating the model alone is no longer sufficient. The harness around it is doing as much work as the model, and possibly more. Two harnesses running the same foundation model will produce different agents.

If you are building an agent, the engineering work that pays back most is rarely in the prompt. It is in the layers that govern context, verify memory, and check skill output. The model gets the credit when the agent succeeds. The harness deserves a meaningful share of it.

If you are buying agentic AI as a customer, ask what the verification and governance layer actually does. Ask what gets persisted, what gets updated, and what leaves an unalterable audit trail. Without those standards, “learning agents” risk becoming opaque accumulations of prompts, notes, and heuristics rather than reliable adaptive systems.

The paper’s broader argument is that the labs and teams that solve harness scaling, alongside model scaling, will define the next phase of agentic AI. They will not necessarily be the ones with the smartest model. They will be the ones whose systems remember accurately, route reliably, verify systematically, and leave a record of what they did.

Frequently Asked Questions

What is an AI harness?

The harness is the structured execution layer that wraps a foundation model and translates its outputs into real-world behavior. It typically includes six components: a reasoning substrate (the model), a memory store, a context constructor, a skill-routing layer, an orchestration loop, and a verification-and-governance layer. Together they determine what an agent can actually do in production.

Why is the harness the next bottleneck, rather than the model itself?

Because an agent’s behavior in production is determined by the entire system, not just the model. Two agents running the same foundation model can behave very differently depending on how their harness governs context, verifies memory, and checks skill outputs. As foundation models become more capable, the harness becomes the limiting factor on whether that capability actually translates into reliable agent behavior.

What is “context rot” and how do you avoid it?

Context rot is what happens when a context window becomes long enough that the model struggles to find the relevant information amid the noise. Large context windows do not solve it; they make it worse. The fix is treating context assembly as a strict selection policy that includes only the minimum sufficient information for the task at hand. Production harnesses use techniques like layered compaction and writing large outputs to disk while supplying short previews to the model.

What is the “stale-but-confident” memory problem?

It is when an agent retrieves a memory that is high-ranking in semantic search but no longer true in the live environment, and acts on it without checking. The fix is skeptical memory: treating memory entries as unverified hints to be confirmed against the live system before any destructive action is taken.

Why is one-shot evaluation insufficient for agents?

Because it measures only whether the task was completed, not how. An agent that succeeds via a hundred retries has the same success score as one that succeeds cleanly, even though the first one has burned far more tokens and time. Realistic evaluation requires process metrics across extended runs, including efficiency, verification overhead, and whether the agent’s behavior improves or degrades with repeated use.