Harness Effect · Agentic Benchmarks · May 2026

The Harness Moves the Score

How much of a frontier model's benchmark score is the model, and how much is the harness around it? To find out, we collected 64 same-model, different-harness pairs from public leaderboards across 9 agentic benchmarks. The median gap is ~16 percentage points.

Synopticon Research · May 11, 2026

Two horses both labelled Claude Opus 4.5 racing on a steeplechase course. The left horse pulls a sulky labelled Generic Agent and posts a 42% benchmark score. The right horse, ridden by a jockey wearing Claude Code colours, jumps GAIA, SWE-Bench, and CORE-Bench fences and posts 78%. Caption: Same Model. Different Harness. Different Result. — Same model, different harness. The benchmark score swings by tens of points.

On CORE-Bench Hard, Claude Opus 4.5 scores 42% with Princeton's baseline CORE-Agent and 78% with Anthropic's Claude Code. The model and the tasks are identical; the harness around the model (its tools, its agent loop, its system prompt) is not. We collected 63 more same-model pairs from public leaderboards across 9 agentic benchmarks. The gap belongs to the harness.

Median harness Δ 15.6 pp, max 48 pp, on the same model

Each dot is one model evaluated under two different agentic frameworks. Per-benchmark medians marked with I-beams.

Sources: HAL leaderboards · swe-bench/experiments.

Where the pairs come from

Most pairs come from HAL, Princeton's Holistic Agent Leaderboard, which runs each model under multiple frameworks on standardised infrastructure across six benchmarks: GAIA (web research), SWE-bench Verified Mini (coding), CORE-Bench Hard (scientific reproducibility), TAU-bench Airline (customer service), Online Mind2Web (web tasks), and ScienceAgentBench. The rest come from swe-bench/experiments (Verified, Lite, Multimodal) and the ARC Prize board.

How big is it, and where?

The aggregate median is 15.6 pp. The per-benchmark spread is wider.

The gap varies by benchmark, biggest where a purpose-built scaffold faces a generic agent

Median |Δ| pp per benchmark. SWE-bench Mini and GAIA pit specialised scaffolds against HAL Generalist; web-task benchmarks have converged on a small number of designs.

Sources: HAL leaderboards · swe-bench/experiments.

Each bar is the median of |delta_pp| across that benchmark's same-model pairs. SWE-bench Verified Mini has the most pairs (n = 14) and the widest spread; Online Mind2Web (n = 6) the narrowest.

The harness Δ is biggest where the leaderboard is still being figured out and smallest where it has settled on a winning design. The top benchmarks (Mini, GAIA, TAU-Airline) host both purpose-built scaffolds and generic agents competing on the same model; different design choices produce different scores. The bottom benchmarks (Online Mind2Web, SWE-bench Verified) host scaffolds that have converged on similar designs, so swapping the harness barely moves the score.

Why the gap exists

A purpose-built scaffold encodes three things a generic agent does not. A tool surface fitted to the benchmark: file-edit and pytest tools for code, DOM and click tools for web. An agent loop tuned to that task class's failure modes: re-plan after a failed test, re-locate after a navigation change. And a system prompt that frames the task (role, output format, how to call the tools) so the model spends compute on solving rather than discovering the harness.

A generic agent has none of that. It has to discover the right tool, the right loop, and the right framing inside the eval. The harness Δ is the cost of that discovery. Web tasks have converged on Browser-Use and SeeAct, so the gap is small. Code tasks have not: SWE-Agent, Agentless, OpenHands, ACoder, and Refact.ai encode different bets, and those bets produce different scores.

Specialised vs generic: do scaffolds beat agents?

If specialised scaffolds reliably beat generic agents, every dot should sit above the y = x diagonal.

y > x: the specialised scaffold wins, on the same model

36 same-model pairs. X-axis is the score with a generic agent; y-axis is the score with a benchmark-specific scaffold. 34 of 36 dots sit above the diagonal.

Source: HAL leaderboards (GAIA, SWE-bench Verified Mini, TAU-bench Airline, CORE-Bench Hard, ScienceAgentBench).

The subset is 36 pairs where one side is benchmark-specific (SWE-Agent, TAU-bench Tool Calling, Claude Code, CORE-Agent, SAB Self-Debug) and the other is generic (HAL Generalist, HF Open Deep Research, RAG, direct prompting). Pairs where both sides are specialised or both generic are excluded; they don't test the question.

On the benchmarks where the comparison is possible, the harness layer accounts for 30–50% of the score.

Both on-diagonal dots are Online Mind2Web, the most converged benchmark. Where design space is still open (code, customer service, scientific reproducibility), the specialised harness does 30–50% of the work. A model-only baseline does not represent what the same model can do under a real scaffold.

Does owning both layers pay off? Anthropic, the test case.

Anthropic builds both Claude and Claude Code. If integration is a real moat, the same Claude should beat itself under a third-party harness. Anthropic is the only frontier lab whose same-model harness premium is in the public record in a format we can test; the OpenAI and Google equivalents come later.

CORE-Bench Hard (scientific code reproducibility, 45 tasks) is the public benchmark that runs the same Claude under both Claude Code and a third-party baseline (CORE-Agent). HAL has reported this for three Claude generations. Claude Code wins by 18–36 pp in all three.

Claude Code adds 18–36 pp over a generic agent on the same Claude model

CORE-Bench Hard. Each row is one Claude generation; only the harness changes.

Source: HAL CORE-Bench Hard. Same model + Claude Code (Anthropic) vs same model + CORE-Agent (Princeton).

The premium grows with the model. Opus 4.5 gains 35.6 pp, Sonnet 4 gains 17.8 pp. Newer Claudes get more from Claude Code, not less. An 18–36 pp swing from one harness choice, invisible in raw model benchmarks. OpenAI has not published Codex-CLI-vs-other-harness on the same OpenAI model; Google has published informal Jules-vs-Gemini-CLI numbers but nothing in this format. The integration-moat hypothesis is tested on one lab so far; the other two await data.

Are expensive harnesses worth it?

Harness cost spans $1 to $1,600 per task on HAL. If you pay 1,000× more, do you get 1,000× more Δ?

Cost efficiency: harness Δ vs USD per task, log-scale x

43 same-model pairs from HAL with cost data on both sides. Cheap pairs ($1–$10) often match or beat expensive ones ($100–$1,500). Cost and Δ are weakly correlated.

Source: HAL leaderboards, COST (USD) column. 43 of 64 pairs have published cost for both harnesses.

If price predicted performance we would see a clear upward trend. We see scatter. The most efficient pair is DeepSeek V3 on ScienceAgentBench: +14.7 pp for $2.09 per task, or 7.0 pp per dollar. The least efficient is Browser-Use on Online Mind2Web with Claude models: $1,150–$1,577 per task for 2–10 pp.

Five of the top six pairs by pp-per-dollar are TAU-bench Tool Calling vs HAL Generalist Agent on customer-service tasks. Design choices that score higher also cost less, the opposite of what catalog pricing would predict. Specialisation wins twice: quality and price.

Is the effect fading?

Two natural objections both predict a shrinking harness Δ. (a) Smarter models internalise the work scaffolds do. (b) Scaffold design matures and closes the headroom on base models. Both predict the wrong sign.

Smarter models do not get a smaller harness lift

38 pairs where the base model has a published Artificial Analysis Intelligence Index (AAII). Slope of the OLS fit is +0.49 pp per AAII point.

AAII source: Artificial Analysis snapshot 2026-05-09. Harness Δ from this dataset (38 of 64 pairs matched).

AAII is Artificial Analysis's composite index across reasoning, coding, math, and tool-use evals: a single number scoring frontier models from ~10 (GPT-4o-mini class) to ~60 (GPT-5.5 Pro class). We matched 38 of 64 pairs directly. The 26 unmatched are mostly older Claude 4 generations, now superseded.

If smarter models needed less help, we would see a negative slope. We see a faintly positive one. The harness premium is invariant to model capability over the range we sampled. A GPT-5.5-class model under the right scaffold beats itself under the wrong one by the same margin a Claude 3-class model would.

The gap is widening, not narrowing

All 64 pairs by base-model release date. Slope of the OLS fit is +2.6 pp per year: scaffolds pull away from base models, not converge.

Release dates from Artificial Analysis and model release notes.

Each dot is anchored to its base model's release date, not the harness submission date. Data spans Mar 2023 (GPT-4 1106) to Oct 2025 (Claude Sonnet 4.5). Newer models are over-represented because they have more leaderboard submissions.

The convergence story (scaffolds hit diminishing returns as base models improve) predicts a falling slope. We get a rising one. Scaffold research is outpacing model research on agentic tasks. Caveat: most new scaffolds in our dataset (SWE-Agent, OpenHands, Refact.ai, EPAM AI/Run, ACoder) are code-specific, where the design space is widest. Whether the trend holds outside code is the open question.

How much is the gap widening, and where?

The previous section showed the gap is not closing. The pooled view quantifies how much it has grown, and the per-benchmark view shows where. Smoothed-median trends across 64 pairs and nine benchmarks put the gap between vanilla and best-harness scores at +20 pp in late 2023, widening to +23 pp by late 2025.

The premium grew from +20 pp to +23 pp

Every pair as a dot. LOWESS-smoothed median trends (60% span) pooled across nine benchmarks. The shaded band is the gap: extra score the harness adds.

Sources: HAL leaderboards (Princeton) · swe-bench/experiments. Smoothing: LOWESS, 60% span over day-precise release dates.

The vanilla scaffold's median rose from ~5% (best a 2023 model could do in a bare agent loop) to ~42% (best from a late-2025 Claude in the same loop). The best-harness median rose from ~26% to ~76%. The harness kept adding more on top, not less.

That is the pooled view. Decompose per benchmark and the picture sharpens: for each eval, fit two OLS slopes over release date and plot each benchmark as one dot in slope-vs-slope space.

Two benchmarks where the harness is winning, one where the model is

Each dot is one benchmark; dot size scales with sample size. Bold labels mark benchmarks whose slope gap is statistically distinguishable from zero; dimmed labels are noise at current n.

Sources: HAL leaderboards (Princeton) · swe-bench/experiments. Bootstrap CIs computed on this dataset (n = 4 to 14 per benchmark).

Three benchmarks have CIs that don't cross the diagonal. CORE-Bench Hard (harness +6.0 pp/mo, vanilla +3.2) and GAIA (harness +4.2, vanilla +0.3) sit above the diagonal; the harness is winning on both. SWE-bench Lite sits below; the model is. The other six cluster on the diagonal with overlapping CIs; their gaps are indistinguishable from zero at n = 4–14.

Where the scaffold's design space is largely solved, base-model gains dominate; where it isn't, the scaffold keeps pulling ahead. SWE-bench is the most-studied agentic eval, with six years of harness iteration. CORE-Bench is newer, less crowded, and its best entry is Claude Code, which Anthropic redesigned for research replication six months ago. GAIA's top scaffold is the HAL Generalist Agent, similarly young. Whether the pattern survives more releases is the open question.

Two caveats. N is small (4 to 14 per benchmark); the wide error crosses on most dots make this visible. And CORE-Bench's "harness winning" result is Anthropic's home turf. Same lab makes both the model and the harness, so the slope can read as scaffold research outpacing model research, or as one vertically integrated lab outpacing the field.

In the literature

Three recent papers measure the same phenomenon at different layers of the agent stack. Together they corroborate our finding, sharpen one of the implications, and quantify a limitation.

The convergent finding

A skill is not a harness. A harness wraps the model and runs the agent loop. A skill is a composable module (a markdown file plus optional scripts) that the harness loads at runtime to specialise for a task. The harness sits one layer above the model; the skill sits one layer below the harness.

SkillsBench (Feb 2026) measures the lift from curated skills across seven model × harness configurations on 84 tasks across 11 domains. Their average lift is +16.2 pp. Our dataset reports a median +15.6 pp at the layer above. Two independent groups, two layers apart, the same magnitude.

Layer	What it adds	Paper	Effect
Harness	Agent loop that wraps the model	This piece (64 pairs)	+15.6 pp median
Skill	Composable module the harness loads at runtime	SkillsBench (7 configs × 84 tasks)	+16.2 pp average

The agent stack compounds. Each layer contributes roughly 15 pp on top of the model on the kinds of tasks both papers measure.

Anthropic's edge extends down a layer

SkillsBench reports that Claude Code shows the most consistent skill uplift (+13.9 to +23.3 pp), and that Codex CLI "frequently neglects provided Skills". Gemini CLI also uses skills reliably (+13.6 to +17.4 pp) in their tests. So the cleaner reading is: Claude Code and Gemini CLI consume the skill layer; Codex CLI does not. The Anthropic premium claim is narrower than "best on every front". It is best in the public same-model harness comparison we have.

SoK: Agentic Skills (Feb 2026) adds two pieces of nuance. First, self-generated skills (skills an agent writes for itself) degrade performance by an average −1.3 pp. Curation matters. Second, per-domain variance is large: healthcare skills add +51.9 pp, manufacturing +41.9 pp, software engineering only +4.5 pp. The harness Δ also varies by benchmark (our Online Mind2Web median is 8 pp vs SWE-bench Verified Mini's 31 pp). Both layers reward domain-specific design effort.

The counter-finding

SWE-Skills-Bench (Mar 2026) tests whether skills help on real-world software engineering rather than agentic benchmarks. The answer is mostly no. Of 49 curated skills tested on Claude Haiku 4.5 + Claude Code, 39 produced zero pass-rate improvement and the average gain was +1.2%. Three skills hurt performance. The leaderboard Δ shrinks roughly an order of magnitude on in-the-wild tasks.

The reproducibility caveat

OAgents (ICML 2025) dissects agent design components on GAIA and BrowseComp. Their finding: "the lack of a standard evaluation protocol makes previous works, even open-sourced ones, non-reproducible, with significant variance between random runs." Our dataset is a single snapshot per pair. Individual entries carry run-to-run noise. The medians and IQRs in this piece are less noisy than any single row.

Implications

Five claims follow from the data above. Each is anchored to a specific pair. None is a recommendation.

01 · Pricing

The harness layer is undercapitalised relative to its share of score.

Model labs have raised more than $200B cumulatively. Harness-layer companies have raised a small fraction of that. The harness accounts for 30–50% of score on agentic benchmarks where the comparison is possible.

Anchor: o4-mini Low swings from 6% (HAL Generalist) to 54% (SWE-Agent) on SWE-bench Verified Mini. +48 pp on the same model.

02 · Integration

Anthropic's same-model premium is in the public record; competitors' is not (yet).

On CORE-Bench Hard, Claude Code beats a generic CORE-Agent by 18–36 pp across the three Claude generations HAL has tested with Claude Code. OpenAI and Google have not published equivalent same-model harness comparisons in this format. Model-only leaderboards miss this layer entirely.

Anchor: Opus 4.5 + Claude Code 77.8% vs Opus 4.5 + CORE-Agent 42.2%.

03 · Verticals

Code scaffolds are crowded; other verticals have neither a benchmark nor a winner.

The market has priced the harness layer in code. Cursor's last round was ~$29B (Nov 2025), with April 2026 reporting putting it in talks at ~$50B; Cognition closed at ~$10B (Sept 2025) and is reportedly in talks near $25B. The pricing reflects the same asymmetry the data shows: a public leaderboard with a measurable Δ on every release. Legal, clinical, finance, and accounting agents have no such leaderboard, so the Δ is unobserved and the corresponding pricing isn't there.

Anchor: TAU-bench Tool Calling beats HAL Generalist on customer-service tasks by 14–38 pp across 9 models; a vertical where the leaderboard does exist.

04 · API margins

Pure-API providers without a harness story are commoditised.

If a third-party harness explains 30%+ of the score and anyone can buy it, the model API is a commodity input. Cohere, Together, and Mistral La Plateforme are the most exposed. OpenRouter and Portkey are picks-and-shovels on the same trend.

Anchor: o4-mini Low scores 6% with HAL Generalist and 54% with SWE-Agent.

05 · Open source

The top harnesses are open-source; the proprietary ones build on them.

SWE-Agent, OpenHands, Aider, and Browser-Use are open-source frameworks that have repeatedly anchored top SWE-bench submissions. The proprietary scaffolds at the top of the same boards (Live-SWE-agent, Augment, and others) share open-source design heritage. The frontier design lineage is in public repos, which means the data and the code behind a top score are both observable.

Anchor: The SWE-bench Verified top board mixes proprietary scaffolds (Live-SWE-agent, Augment) with open frameworks; both lineages trace to open-source designs.

Risks to the thesis

Four caveats, in order of how much they undercut the thesis.

Benchmark gaming. SWE-Skills-Bench (Mar 2026) reports that curated skills add only +1.2 pp on real-world software engineering, against +16.2 pp under SkillsBench's benchmark conditions. The harness Δ likely shrinks an order of magnitude in the wild.
Selection bias. HAL and SWE-bench leaderboards over-represent submissions optimised for the leaderboard. Production deployments use less-tuned scaffolds. The harness Δ is an upper bound, not an average.
Distribution can override technical advantage. IDE penetration, default settings, and enterprise sales let model labs absorb harness value back. Microsoft + Copilot is the precedent.
Eventual convergence. If frontier models internalise agent loops natively, the harness Δ shrinks. o3 and Claude Sonnet 4.5+ trend this way. We have not seen the gap close in 18 months, but it could.

Methodology

A pair is two entries on the same public leaderboard where (1) the base model is identical after normalising version strings, (2) the benchmark is identical, and (3) the agentic framework differs. Reasoning-effort, sample-count, and skill-toggle changes are excluded; they are not framework changes.

Sources: HAL leaderboards (Princeton, six benchmarks, scraped via Playwright); swe-bench/experiments repository (Verified, Lite, Multimodal splits, scored as n_resolved / n_total, then lowest- vs highest-scoring system per model); ARC Prize leaderboard. Cost-efficiency: 43 / 64 pairs where HAL publishes COST (USD) for both harnesses. AAII source: Artificial Analysis snapshot 2026-05-09 (38 / 64 pairs matched).

Numerical notes & what's excluded

Headline numbers are over all 64 pairs: median |Δ| 15.6 pp; mean 18.6 pp; p90 37.3 pp; max 48 pp. 70% of pairs see ≥10 pp, 39% see ≥20 pp.

Run-to-run noise. Each pair is a single snapshot. OAgents (ICML 2025) reports significant variance on the same model + agent + benchmark combination across re-runs. Medians and IQRs in this piece are less noisy than any single row.

Excluded from the dataset. OSWorld (113 entries parsed but no qualifying pairs, since variation comes from step-budget changes); Cybench (single entry per model); Anthropic / OpenAI internal evals (not public); ARC-AGI-2 (only one same-model pair available, dropped to avoid distorting per-benchmark statistics).

Skill vs harness. We measure framework-vs-framework effects at the harness layer. SkillsBench measures skill-vs-no-skill effects one layer below.

Scripts: research/harness-effect/scripts/. Dataset: research/harness-effect/data/harness_pairs.csv.