Two horses both labelled Claude Opus 4.5 on a steeplechase: one pulling a Generic Agent sulky, the other ridden by a Claude Code jockey jumping GAIA, SWE-Bench, and CORE-Bench fences.
May 2026 Benchmarks ~9 min read

The Harness Moves the Score

Claude Opus 4.5 on CORE-Bench Hard scores 42% with a generic agent and 78% with Claude Code. Same model, same benchmark. We collected 64 same-model, different-wrapper pairs from HAL, swe-bench/experiments, and the ARC Prize board: median harness Δ is 15.6 pp, 90th percentile 39 pp, max 48 pp. The wrapper often does more work than the model release notes admit. Independent confirmation arrives from SkillsBench (+16.2 pp at the skill layer below), and the gap widens as models get smarter and scaffolds mature, not narrows.

Read the analysis →
Public-market backdoor to the AI frontier — two cows labeled Anthropic and OpenAI feeding milk into labelled cans for the public shareholders.
April 2026 Markets ~8 min read

How Public Markets Already Own the AI Frontier

Anthropic and OpenAI are private — but their cap tables sit on six US-listed names. We map who owns what, how much of each headline number is real cash vs. accounting markup, and where the most concentrated public-market exposure to either lab actually lives. Includes both ownership cards and the four caveats every reader should know.

Read the analysis →
A bestiary of METR task complexity across frontier models — pedestals ascending toward a fire-breathing dragon labeled MYTHOS
April 2026 Benchmarks ~7 min read

How Long Can Claude Mythos Work Alone?

METR's autonomous-task benchmark hasn't scored Claude Mythos yet, but every other frontier model has both a METR time horizon and an IRT (composite benchmark ability) score. We fit four regression families across 19 model points and used the piecewise prediction to land Mythos at roughly 15 hours of unsupervised work. Under the trend acceleration implied by Anthropic's Opus 4 → 4.5 → 4.6 cadence, that figure climbs toward 30. Includes the full regression panel, leave-one-out cross-validation, and a parallel forecast for Meta's Muse Spark.

Read the analysis →
Composer 2 vs Claude — Roman-colosseum illustration of Cursor's Kimi-K2.5-backed model facing Claude with pricing tablets
March 2026 Adoption ~8 min read

Moonshot Built the Engine. Cursor Sold the Car.

On March 19, Cursor shipped Composer 2 without naming the base. A developer found Kimi K2.5 underneath within 24 hours. The omission was the story: Moonshot spent $8.8M training K2.5 and moved coding benchmarks by an average of 26.7 points; Cursor spent $26–35M wrapping it and moved them by 5.8. The model layer is where capability lives. The app layer is where it gets paid.

Read the analysis →