All published research
Long-form analysis of the AI capability race — model releases, benchmark breakthroughs, enterprise routing shifts, and what the data underneath actually says.
The Harness Moves the Score
Claude Opus 4.5 on CORE-Bench Hard scores 42% with a generic agent and 78% with Claude Code. Same model, same benchmark. We collected 64 same-model, different-wrapper pairs from HAL, swe-bench/experiments, and the ARC Prize board: median harness Δ is 15.6 pp, 90th percentile 39 pp, max 48 pp. The wrapper often does more work than the model release notes admit. Independent confirmation arrives from SkillsBench (+16.2 pp at the skill layer below), and the gap widens as models get smarter and scaffolds mature, not narrows.
Read the analysis →
How Public Markets Already Own the AI Frontier
Anthropic and OpenAI are private — but their cap tables sit on six US-listed names. We map who owns what, how much of each headline number is real cash vs. accounting markup, and where the most concentrated public-market exposure to either lab actually lives. Includes both ownership cards and the four caveats every reader should know.
Read the analysis →
How Long Can Claude Mythos Work Alone?
METR's autonomous-task benchmark hasn't scored Claude Mythos yet, but every other frontier model has both a METR time horizon and an IRT (composite benchmark ability) score. We fit four regression families across 19 model points and used the piecewise prediction to land Mythos at roughly 15 hours of unsupervised work. Under the trend acceleration implied by Anthropic's Opus 4 → 4.5 → 4.6 cadence, that figure climbs toward 30. Includes the full regression panel, leave-one-out cross-validation, and a parallel forecast for Meta's Muse Spark.
Read the analysis →
Moonshot Built the Engine. Cursor Sold the Car.
On March 19, Cursor shipped Composer 2 without naming the base. A developer found Kimi K2.5 underneath within 24 hours. The omission was the story: Moonshot spent $8.8M training K2.5 and moved coding benchmarks by an average of 26.7 points; Cursor spent $26–35M wrapping it and moved them by 5.8. The model layer is where capability lives. The app layer is where it gets paid.
Read the analysis →