How much of a frontier model's benchmark score is the model, and how much is the wrapper around it? Across 64 same-model, different-wrapper pairs on 9 agentic benchmarks, the median gap is 16 percentage points.
Sam Donahue · May 11, 2026
A pair is one base model run on one benchmark under two different agentic frameworks. SWE-Agent versus Agentless on GPT-4o. HAL Generalist versus Claude Code on Claude Opus 4.5. HF Open Deep Research versus HAL Generalist on Claude Sonnet 4.5. Same model, different wrapper, different score. The gap belongs to the wrapper.
Most pairs come from HAL, Princeton's Holistic Agent Leaderboard. HAL runs each model under multiple frameworks on standardised infrastructure across six benchmarks: GAIA (web research), SWE-bench Verified Mini (coding), CORE-Bench Hard (scientific reproducibility), TAU-bench Airline (customer service), Online Mind2Web (web tasks), and ScienceAgentBench. Standardised infrastructure makes the harness Δ on HAL unusually clean. The rest come from swe-bench/experiments (Verified, Lite, Multimodal) and the ARC Prize board.
The aggregate median is 15.6 pp. The per-benchmark spread is wider.
Each bar is the median of |delta_pp| across that benchmark's same-model pairs. SWE-bench Verified Mini has the most pairs (n = 14) and the widest spread; Online Mind2Web (n = 6) the narrowest.
Bar order is the first clue to the mechanism. Top benchmarks (Mini, GAIA, TAU-Airline) host both purpose-built scaffolds and generic agents on the same leaderboard. Bottom benchmarks (Online Mind2Web, SWE-bench Verified) host scaffolds that have converged. The harness effect is large when the leaderboard is unsettled and shrinks as design space converges.
A purpose-built scaffold encodes three things a generic agent does not. A tool surface fitted to the benchmark: file-edit and pytest tools for code, DOM and click tools for web. An agent loop tuned to that task class's failure modes: re-plan after a failed test, re-locate after a navigation change. And a system prompt that already knows what the eval looks like.
A generic agent has none of that. It has to discover the right tool, the right loop, and the right framing inside the eval. The harness Δ is the cost of that discovery. Web tasks have converged on Browser-Use and SeeAct, so the gap is small. Code tasks have not: SWE-Agent, Agentless, OpenHands, ACoder, and Refact.ai encode different bets, and those bets produce different scores.
If specialised scaffolds reliably beat generic agents, every dot should sit above the y = x diagonal.
The subset is 36 pairs where one side is benchmark-specific (SWE-Agent, TAU-bench Tool Calling, Claude Code, CORE-Agent, SAB Self-Debug) and the other is generic (HAL Generalist, HF Open Deep Research, RAG, direct prompting). Pairs where both sides are specialised or both generic are excluded; they don't test the question.
On the benchmarks where the comparison is possible, the harness layer accounts for 30–50% of the score.
Both on-diagonal dots are Online Mind2Web, the most converged benchmark. Where design space is still open (code, customer service, scientific reproducibility), the specialised wrapper does 30–50% of the work. A model-only baseline does not represent what the same model can do under a real scaffold.
Anthropic builds both Claude and Claude Code. If integration is a real moat, the same Claude should beat itself under a third-party wrapper.
CORE-Bench Hard (scientific code reproducibility, 45 tasks) is the public benchmark we found that runs the same Claude under both Claude Code and a third-party baseline (CORE-Agent). HAL has reported this for three Claude generations. Claude Code wins by 18–36 pp in all three.
The premium grows with the model. Opus 4.5 gains 35.6 pp, Sonnet 4 gains 17.8 pp. Newer Claudes get more from Claude Code, not less. OpenAI hasn't published a Codex-CLI-vs-other-harness comparison on the same OpenAI model on CORE-Bench; Google has published informal Jules-vs-Gemini-CLI numbers but no rigorous public benchmark in this format. What is in the public record on CORE-Bench is an 18–36 pp swing from one wrapper choice, invisible in raw model benchmarks.
Harness cost spans $1 to $1,600 per task on HAL. If you pay 1,000× more, do you get 1,000× more Δ?
COST (USD) column. 43 of 64 pairs have published cost for both harnesses.If price predicted performance we would see a clear upward trend. We see scatter. The most efficient pair is DeepSeek V3 on ScienceAgentBench: +14.7 pp for $2.09 per task, or 7.0 pp per dollar. The least efficient is Browser-Use on Online Mind2Web with Claude models: $1,150–$1,577 per task for 2–10 pp.
Five of the top six pairs by pp-per-dollar are TAU-bench Tool Calling vs HAL Generalist Agent on customer-service tasks. Design choices that score higher also cost less, the opposite of what catalog pricing would predict. Specialisation wins twice: quality and price.
Two natural objections both predict a shrinking harness Δ. (a) Smarter models internalise the work scaffolds do. (b) Scaffold design matures and closes the headroom on base models. Both predict the wrong sign.
AAII is Artificial Analysis's composite index across reasoning, coding, math, and tool-use evals: a single number scoring frontier models from ~10 (GPT-4o-mini class) to ~60 (GPT-5.5 Pro class). We matched 44 of 64 pairs directly. The 20 unmatched are mostly older Claude 4 generations, now superseded.
If smarter models needed less help, we would see a negative slope. We see a faintly positive one. The harness premium is invariant to model capability over the range we sampled. A GPT-5.5-class model under the right scaffold beats itself under the wrong one by the same margin a Claude 3-class model would.
Each dot is anchored to its base model's release date, not the harness submission date. Data spans Mar 2023 (GPT-4 1106) to Oct 2025 (Claude Sonnet 4.5). Newer models are over-represented because they have more leaderboard submissions.
The convergence story (scaffolds hit diminishing returns as base models improve) predicts a falling slope. We get a rising one. Scaffold research is outpacing model research on agentic tasks. Caveat: most new scaffolds in our dataset (SWE-Agent, OpenHands, Refact.ai, EPAM AI/Run, ACoder) are code-specific, where the design space is widest. Whether the trend holds outside code is the open question.
Both layers improved, but the harness pulled away faster. Pool the 64 pairs across nine benchmarks, smooth the medians, and the gap between vanilla and best-harness scores widened from +21 pp in late 2023 to +34 pp by late 2025.
The vanilla scaffold's median rose from ~5% (best a 2023 model could do in a bare agent loop) to ~42% (best from a late-2025 Claude in the same loop). The best-harness median rose from ~26% to ~76%. The wrapper kept adding more on top, not less.
That is the pooled view. Decompose per benchmark and the picture sharpens: for each eval, fit two OLS slopes over release date and plot each benchmark as one dot in slope-vs-slope space.
Three benchmarks have CIs that don't cross the diagonal. CORE-Bench Hard (harness +6.0 pp/mo vs vanilla +3.2) and GAIA (harness +4.2 vs vanilla +0.3) sit clearly above. The wrapper is improving faster than the underlying model on both. SWE-bench Lite sits clearly below. Model gains are outpacing the harness layer there. The other six cluster on or near the diagonal with crosses that overlap it, meaning we can't distinguish their gaps from zero at current n.
A reading that fits the spread: where the scaffold's design space is largely solved, base-model gains dominate; where it isn't, the scaffold keeps pulling ahead. SWE-bench is the most-studied agentic eval in history, with six years of harness iteration. CORE-Bench is newer, less crowded, and its best entry is Claude Code, which Anthropic redesigned for research replication six months ago. GAIA's top scaffold is the HAL Generalist Agent, similarly young. Whether the pattern survives more releases is the open question.
Two caveats. N is small (4 to 14 per benchmark); the wide error crosses on most dots make this visible. And CORE-Bench's "harness winning" result is Anthropic's home turf. Same lab makes both the model and the wrapper, so the slope can read as scaffold research outpacing model research, or as one vertically integrated lab outpacing the field.
Five claims follow from the data above. Each is anchored to a specific pair. None is a recommendation.
Model labs have raised more than $200B cumulatively. Harness-layer companies have raised a small fraction of that. The harness accounts for 30–50% of score on agentic benchmarks where the comparison is possible.
On CORE-Bench Hard, Claude Code beats a generic CORE-Agent by 18–36 pp across the three Claude generations HAL has tested with Claude Code. OpenAI and Google have not published equivalent same-model harness comparisons in this format. Model-only leaderboards miss this layer entirely.
Cursor's last round was ~$29B (Nov 2025); reporting in April 2026 put it in talks at ~$50B. Cognition closed at ~$10B (Sept 2025) and is reportedly in talks near $25B. Legal, clinical, finance, and accounting agents have no public same-model leaderboard yet.
If a third-party harness explains 30%+ of the score and anyone can buy it, the model API is a commodity input. Cohere, Together, and Mistral La Plateforme are the most exposed. OpenRouter and Portkey are picks-and-shovels on the same trend.
SWE-Agent, OpenHands, Aider, and Browser-Use are open-source frameworks that have repeatedly anchored top SWE-bench submissions. Star and contributor velocity on those repos lead the enterprise-product cycle by months. Synopticon's GitHub feed tracks them.
Three recent papers measure the same phenomenon at different layers of the agent stack. Together they corroborate our finding, sharpen one of the implications, and quantify a limitation.
A skill is not a harness. A harness wraps the model and runs the agent loop. A skill is a composable module (a markdown file plus optional scripts) that the harness loads at runtime to specialise for a task. The harness sits one layer above the model; the skill sits one layer below the harness.
SkillsBench (Feb 2026) measures the lift from curated skills across seven model × harness configurations on 84 tasks across 11 domains. Their average lift is +16.2 pp. Our dataset reports a median +15.6 pp at the layer above. Two independent groups, two layers apart, the same magnitude.
| Layer | What it adds | Paper | Effect |
|---|---|---|---|
| Harness | Agent loop that wraps the model | This piece (64 pairs) | +15.6 pp median |
| Skill | Composable module the harness loads at runtime | SkillsBench (7 configs × 84 tasks) | +16.2 pp average |
The agent stack compounds. Each layer contributes roughly 15 pp on top of the model on the kinds of tasks both papers measure.
SkillsBench reports that Claude Code shows the most consistent skill uplift (+13.9 to +23.3 pp), and that Codex CLI "frequently neglects provided Skills". Gemini CLI also uses skills reliably (+13.6 to +17.4 pp) in their tests. So the cleaner reading is: Claude Code and Gemini CLI consume the skill layer; Codex CLI does not. The Anthropic premium claim is narrower than "best on every front". It is best in the public same-model harness comparison we have.
SoK: Agentic Skills (Feb 2026) adds two pieces of nuance to the skill-layer picture. First, self-generated skills (skills an agent writes for itself) degrade performance by an average −1.3 pp. Curation matters. Second, per-domain variance is large: healthcare skills add +51.9 pp, manufacturing +41.9 pp, software engineering only +4.5 pp. The harness Δ also varies by benchmark (our Online Mind2Web median is 8 pp vs SWE-bench Verified Mini's 31 pp). Both layers reward domain-specific design effort.
SWE-Skills-Bench (Mar 2026) tests whether skills help on real-world software engineering rather than agentic benchmarks. The answer is mostly no. Of 49 curated skills tested on Claude Haiku 4.5 + Claude Code, 39 produced zero pass-rate improvement and the average gain was +1.2%. Three skills hurt performance. The leaderboard Δ shrinks roughly an order of magnitude on in-the-wild tasks.
OAgents (ICML 2025) dissects agent design components on GAIA and BrowseComp. Their finding: "the lack of a standard evaluation protocol makes previous works, even open-sourced ones, non-reproducible, with significant variance between random runs." Our dataset is a single snapshot per pair. Individual entries carry run-to-run noise. The medians and IQRs in this piece are less noisy than any single row.
Four caveats, in order of how much they should worry you.
A pair is two entries on the same public leaderboard where (1) the base model is identical after normalising version strings, (2) the benchmark is identical, and (3) the agentic framework differs. Reasoning-effort, sample-count, and skill-toggle changes are excluded; they are not framework changes.
Sources: HAL leaderboards (Princeton, six benchmarks, scraped via Playwright); swe-bench/experiments repository (Verified, Lite, Multimodal splits, scored as n_resolved / n_total, then lowest- vs highest-scoring system per model); ARC Prize leaderboard. Cost-efficiency: 43 / 64 pairs where HAL publishes COST (USD) for both harnesses. AAII source: Artificial Analysis snapshot 2026-05-09 (44 / 64 pairs matched).
Headline numbers are over all 64 pairs: median |Δ| 15.6 pp; mean 18.6 pp; p90 37.3 pp; max 48 pp. 70% of pairs see ≥10 pp, 39% see ≥20 pp.
Run-to-run noise. Each pair is a single snapshot. OAgents (ICML 2025) reports significant variance on the same model + agent + benchmark combination across re-runs. Medians and IQRs in this piece are less noisy than any single row.
Excluded from the dataset. OSWorld (113 entries parsed but no qualifying pairs, since variation comes from step-budget changes); Cybench (single entry per model); Anthropic / OpenAI internal evals (not public); ARC-AGI-2 (only one same-model pair available, dropped to avoid distorting per-benchmark statistics).
Skill vs harness. We measure framework-vs-framework effects at the harness layer. SkillsBench measures skill-vs-no-skill effects one layer below.
Scripts: research/harness-effect/scripts/. Dataset: research/harness-effect/data/harness_pairs.csv.