AI Market Intelligence · March 2026

Moonshot Built the Engine. Cursor Sold the Car.

On March 19, Cursor shipped its new coding model on top of a Chinese open-weight base. The pricing gap, and what it says about the API business model, is the real story.

March 20, 2026 · Traced in real-time through 6 independent data feeds

Composer 2 vs Claude: Roman-colosseum illustration of Cursor's Kimi-K2.5-backed Composer 2 model facing Claude across pricing tablets.
Composer 2 vs Claude: the build-vs-buy face-off, with pricing tablets in the foreground.
$8.8M
Training Kimi K2.5
(Moonshot)
$35M
Wrapping it
(Cursor)
$2B+
Selling it
(Cursor ARR)

The discovery

Cursor shipped Composer 2 without naming the base. A developer found Kimi K2.5 underneath within 24 hours. The discovery turned a launch into a value-capture story.

Cursor launched Composer 2 on March 19, 2026 to its 1M+ daily active users. The blog post credited "continued pre-training of a base model, combined with reinforcement learning" but did not name it. Within 24 hours, a developer intercepted the model ID in Cursor's API responses (kimi-k2p5-rl-0317-s515-fast) and identified the base as Kimi K2.5, an open-weight model from Moonshot AI.

Moonshot spent $8.8M training Kimi K2.5 and moved coding benchmarks by an average of 26.7 points. Cursor spent an estimated $26–35M of post-training on top and moved them by 5.8. The capability lives in the base; the revenue accrues to the wrapper.


Is Composer 2 the cheapest?

Composer 2 is cheaper than Claude, but it isn't the cheapest, and the landscape is more crowded than it looks.

API pricing across the coding model landscape
Per million tokens, sorted by input price. Log scale. Composer 2 / Kimi in navy; all other models in gray.

Composer 2's standard tier ($0.50/$2.50) is not the default; the "fast" variant at $1.50/$7.50 is. At that price, it's 2× cheaper than Sonnet 4.6 ($3/$15), not the 10× launch headlines suggested (which compared the standard tier to Claude Opus). DeepSeek V3.2 ($0.28/$0.42) is already cheaper than Composer 2 with comparable coding scores.

The more relevant question isn't "is Composer 2 cheaper than Claude?" It's "what else can you get at the same price, and how does it compare?"

Composer 2 trades blows with the expensive models on coding benchmarks that resist contamination (Terminal-Bench 2.0, SWE-bench Multilingual, LiveCodeBench v6, Next.js Evals).

Competitive on coding benchmarks, despite the price gap
Higher is better. Sorted by score within each panel. Composer 2 and Kimi models in blue.
K2.5: HuggingFace, arXiv 2602.02276 · Composer 2: VentureBeat · SWE-bench: swebench.com · Terminal-Bench: tbench.ai · LiveCodeBench: livecodebench.github.io
Caveats: K2.5 benchmarks are self-reported from model card/paper, not yet on public SWE-bench or Aider leaderboards. SWE-bench Verified excluded: OpenAI retired it in February 2026 citing contamination.

Composer 2 beats Claude Opus 4.6 on Terminal-Bench 2.0 (61.7 vs 59.3). It loses on SWE-bench Multilingual (73.7 vs 77.5). The raw K2.5 base, before Cursor's fine-tuning, leads on LiveCodeBench v6 (85.0 vs 82.2).

Cursor also publishes scores on CursorBench, their proprietary internal evaluation suite. It uses real user sessions sourced via "Cursor Blame" (tracing committed code to agent requests), with tasks averaging 352 lines across 8 files, substantially larger than SWE-bench tasks.

ModelCursorBenchTerminal-Bench 2.0SWE-bench ML
Composer 261.361.773.7
Claude Opus 4.658.258.077.8
Composer 1.544.247.965.9
Composer 138.040.056.9

Source: cursor.com/blog/cursorbench. CursorBench is proprietary; only Cursor can run it. How CursorBench works →

How much does the harness add?

An agent harness is the scaffolding around the model (file access, tool calls, context retrieval) that turns raw token generation into useful coding work. Most benchmarks score the model and harness together, hiding which is doing the work. If the harness moves scores as much as the model does, Cursor's wrapper is its own value layer, separate from the post-training. Vercel's Next.js Evals (OSS on GitHub) isolates the two:

ModelAgentBaselineWith AGENTS.md
GPT 5.3 CodexCodex86%100%
GPT 5.4Codex86%95%
Composer 2Cursor76%95%
Gemini 3.1 ProGemini CLI76%100%
Claude Opus 4.6Claude Code71%100%
Claude Sonnet 4.6Claude Code67%100%
Kimi K2.5OpenCode19%52%

Source: nextjs.org/evals, github.com/vercel/next-evals-oss

K2.5 scores 19% baseline; Composer 2 scores 76%. That's +57 points from Cursor's RL, far larger than the +0.7 on SWE-bench ML. But with documentation (AGENTS.md), Claude reaches 100% while Composer 2 reaches 95%. The agent harness and context retrieval may matter as much as the model.

For developers: Composer 2 is the better default for scaffolding, CRUD endpoints, and routine refactors: 95% of Claude's quality at half the price. Claude is still worth paying for on architecture decisions, concurrency bugs, and anything safety-critical, where the 5-point gap is the difference between shipping and shipping a regression. Pick per task, not per project.


From open-weight base to production agent

Cursor didn't just wrap Kimi K2.5. They invested 4× the base model's compute in continued pre-training and RL.

Aman Sanger's tweet disclosed the key details: Cursor evaluated multiple base models on perplexity, chose K2.5, then applied "continued pre-training and high-compute RL (a 4× scale-up)."

K2.5 is a substantial release, not an incremental update: a full re-pretrain with expanded context (tech report).

Moonshot trained K2.5 for ~$8.8M. Cursor then ran "4× the compute" on top, which lands at $26M of incremental spend if the 4× includes the base, or $35M if it's 4× on top of it. Total Composer 2: $35–44M. FLOP-to-dollar derivation in methodology →

Stage Compute Est. cost How derived Who paid
K2.5 pre-training 3.84×1024 FLOPs ~$8.8M Operation counting from tech report: 8 × 32B × 15T Moonshot AI
Cursor CPT + RL
(applied to K2.5)
~1.0–1.5×1025 FLOPs
≈ 3–4× the K2.5 base
~$26–35M Sanger's "4×" claim applied to the $8.8M base. Reading B: 4× includes the base → Cursor adds ~$26M. Reading A: 4× on top of the base → Cursor adds ~$35M. Cursor via Fireworks AI
Total Composer 2 ~1.4–1.9×1025 FLOPs
≈ 4–5× the K2.5 base
~$35–44M $8.8M base + Cursor's scale-up. Reading B: $8.8M + $26M = $35M. Reading A: $8.8M + $35M = $44M. Moonshot ~20–25% / Cursor ~75–80%

For context, Composer 2's $35–44M sits below Llama 3.1 405B (~$53M est.), an order of magnitude below GPT-4.5 (~$340M est.) and Grok-4 (~$388M est.), and above DeepSeek V3 ($5.6M reported). Final-run costs only; see breakdown →.

The base is substitutable in theory, but switching means re-running the $26–35M RL pipeline on a new foundation with no guarantee the recipe transfers. Interchangeability is real but expensive.

With confirmed benchmarks for all three stages (K2, K2.5, and Composer 2), we can measure exactly what each step contributed:

Each step in the pipeline adds measurable performance
Confirmed scores only. K2 → K2.5 (Moonshot's multimodal retrain) → Composer 2 (Cursor's RL). Opus 4.6 for reference.
K2 Instruct: HuggingFace model card, arXiv 2507.20534, tbench.ai leaderboard · K2.5: arXiv 2602.02276 · Composer 2: Cursor, VentureBeat
All scores confirmed from model cards, tech reports, or public leaderboards. K2 Terminal-Bench 2.0 score (27.8%) from tbench.ai (Terminus 2 agent). SWE-bench Multilingual K2 score (47.3%) from K2 tech report (agentic mode).

Does Composer 2 pay for itself?

At $2B+ ARR, Cursor's gross margin hinges on the model mix. Users pick (or Auto picks for them); the share running through Composer 2 is the margin lever.

Cursor reportedly surpassed $2B in ARR in early 2026. Users can pick any model inside Cursor (Claude, GPT, Composer 2, Gemini), but Cursor's cost per token varies sharply by choice: Composer 2 runs through Fireworks at a fraction of what Cursor pays Anthropic for Claude. If inference consumes 50% of revenue (a common ratio for AI-native products), that's a ~$1B/year line item whose breakdown by model decides margin.

At fast-tier pricing of $1.50/$7.50 (the default), Composer 2's per-token cost is half of Sonnet 4.6's. The mix scenarios below show how that gap compounds:

How model mix would affect margin (thought experiment)
Illustrative scenarios at $2B ARR. Sonnet 4.6 stands in for any premium third-party model Cursor passes through; Composer 2 fast for Cursor-served traffic. Dashed line = 70% SaaS benchmark.
Cursor ARR: TechCrunch, SaaStr · Pricing: Cursor, Anthropic · SaaS benchmark: Bessemer
All figures illustrative. Actual cost structure not publicly disclosed.

Cursor's long-term gross margin is now a single curve: how much of its $2B in usage runs through Composer 2 instead of third-party models like Sonnet, Opus, or GPT-5.4. Every percentage point of share that shifts to Composer 2 fast cuts the per-token cost on that share by roughly half. 50% adoption gets margin to ~63%; 80% touches the 70% SaaS benchmark. The strategic bet isn't on the cost of one model. It's on user behavior bending toward the model Cursor controls.

We don't know Cursor's actual cost structure. These estimates assume 50% of $2B ARR goes to inference, a common ratio for AI-native products but unverified for Cursor. Fireworks' margin (estimated 30–50% gross) is also embedded in Composer 2's pricing.


Who captures the value?

Moonshot built the model. Cursor built the product. The value-capture gap runs the wrong direction.

Two companies, two roles in the same product:

Moonshot AI
Built Kimi K2.5
Cursor
Built Composer 2 on top
Valuation ~$18B (target, Series D) $29.3B
ARR / Revenue ~$500M (est.)
20 days post-K2.5 > all of 2025
$2B+
Doubling every 2–3 months
K2.5 investment ~$8.8M
Pre-trained the base model
~$26–35M
CPT + RL on top of K2.5

Sources: Moonshot: 36Kr, KR-Asia, TechCrunch · Cursor: TechCrunch, Stripe, DevGraphiq. Moonshot revenue is approximate. Cursor headcount estimated.

At first glance, this looks like Cursor is winning. But look at who moves the needle on capabilities:

Benchmark Moonshot
K2 → K2.5 ($8.8M)
Cursor
K2.5 → Composer 2 ($26–35M)
Terminal-Bench 2.0 +23.0 pts +10.9 pts
SWE-bench Multilingual +25.7 pts +0.7 pts
LiveCodeBench v6 +31.3 pts — (no data)
Avg gain per benchmark +26.7 pts +5.8 pts
Cost per point gained $330K / pt $4.5–6.0M / pt

Moonshot is 14–18× more efficient at producing benchmark gains. Some of that is diminishing returns (47→73 is easier than 73→74), but it also reflects the fundamental asymmetry: base model training is the hard, underpaid work.

Both companies are winning, on different units. Moonshot's open-weight release of K2.5 triggered a 20-day revenue surge that exceeded all of 2025. Open-weighting was the distribution strategy, not charity. Cursor turned $26–35M of RL into $2B+ of ARR; the wrapper captures the recurring rent. The 14–18× efficiency advantage and the 4× revenue advantage are both real; they sit at different layers of the stack.

The durable question is which layer defends its share. Nathan Lambert (Interconnects) frames the squeeze on Cursor's side: "post-training got more popular because there was more low-hanging fruit. A lot of that potential has been realized." If the wrapper is a thinning layer (and Cursor's +5.8-point average gain vs Moonshot's +26.7 suggests it is), the rent defense gets harder over time. The squeeze on Moonshot's side is the inverse: if labs move up the stack and ship their own application surface (Moonshot already has the Kimi consumer app), the wrapper's distribution moat narrows. Whether Cursor remains the best at picking open-weight bases and bolting RL on top, or whether the labs reclaim the user, is the bet investors are pricing. Time will tell.


Who's exposed?

The model layer is becoming a commodity. The question is who captures the value that used to sit there.

Who Implication Signal
Cursor / AI-native apps Margin expansion, reduced vendor lock-in, model optionality Positive
Inference providers (Fireworks, Together) Growing demand as apps shift from API to hosted open-weight Positive
Anthropic / OpenAI API revenue Revenue concentration risk if top customers can switch at will Watch
Open-weight labs (Moonshot, DeepSeek) Ecosystem adoption, but limited direct monetization Mixed

The Cursor switch isn't an isolated event. It's the first high-profile instance of a pattern that will repeat: AI-native companies evaluating open-weight bases, applying proprietary fine-tuning, and serving through specialized inference providers, cutting the frontier lab's API out of the loop.

The question for Anthropic and OpenAI isn't whether their models are better. On most benchmarks, they still are. The question is whether "better" justifies 2–5× the price, and for how much longer.


What third parties confirm

Most numbers in this piece come from Cursor, Moonshot, or Fireworks. Here's the subset that independent sources back up.

Signal Data Independent source
Developer adoption Cursor at 18% usage (vs Copilot ~42%) Stack Overflow 2025, JetBrains 2025
Revenue trajectory $100M → $500M → $1B → $2B+ ARR (Jan '25 → Feb '26) Stripe case study, TechCrunch
Enterprise usage Salesforce: 20K engineers, >90% usage rate Pragmatic Engineer
Web traffic cursor.com: #14 in AI tools, #3,004 globally (Oct 2025) SimilarWeb

What we can't measure, but can infer: Cursor doesn't publish the model mix, but the pricing structure tilts it sharply toward Composer 2. Auto mode and Composer 2 are unmetered on paid plans; Claude and GPT draw from a separate API credit pool that runs down. Most users default to the unmetered path, and developer surveys consistently describe Claude as the escape hatch for hard tasks, not the everyday default. The exact ratio is private; the direction is determined by pricing, and it's the lever behind the margin math in Section 4.