AI Market Intelligence · March 2026

Moonshot Built the Engine. Cursor Sold the Car.

On March 19, Cursor shipped its new coding model on top of a Chinese open-weight base. The pricing gap, and what it says about the API business model, is the real story.

March 20, 2026 · Traced in real-time through 6 independent data feeds

Composer 2 vs Claude: Roman-colosseum illustration of Cursor's Kimi-K2.5-backed Composer 2 model facing Claude across pricing tablets. — Composer 2 vs Claude: the build-vs-buy face-off, with pricing tablets in the foreground.

$8.8M

Training Kimi K2.5
(Moonshot)

$35M

Wrapping it
(Cursor)

$2B+

Selling it
(Cursor ARR)

The discovery

Cursor shipped Composer 2 without naming the base. A developer found Kimi K2.5 underneath within 24 hours. The discovery turned a launch into a value-capture story.

Cursor launched Composer 2 on March 19, 2026 to its 1M+ daily active users. The blog post credited "continued pre-training of a base model, combined with reinforcement learning" but did not name it. Within 24 hours, a developer intercepted the model ID in Cursor's API responses (kimi-k2p5-rl-0317-s515-fast) and identified the base as Kimi K2.5, an open-weight model from Moonshot AI.

Mar 19
Cursor ships Composer 2. Blog credits a generic "base model." No mention of Kimi.
Mar 19–20
Model ID discovered. Moonshot's head of pretraining publicly accuses Cursor of license violation.
Mar 20
Cursor responds. Co-founder Aman Sanger: "It was a miss to not mention the Kimi base in our blog. We'll fix that."
Mar 20
Resolution. Moonshot congratulates Cursor. Fireworks AI commercial agreement confirmed.

Moonshot spent $8.8M training Kimi K2.5 and moved coding benchmarks by an average of 26.7 points. Cursor spent an estimated $26–35M of post-training on top and moved them by 5.8. The capability lives in the base; the revenue accrues to the wrapper.

Is Composer 2 the cheapest?

Composer 2 is cheaper than Claude, but it isn't the cheapest, and the field is more crowded than it looks.

API pricing across coding models

Per million tokens, sorted by input price. Log scale. Composer 2 / Kimi in navy; all other models in gray.

OpenAI: openai.com/api/pricing · Anthropic: docs.anthropic.com · DeepSeek: api-docs.deepseek.com · Fireworks: fireworks.ai/kimi · Cursor: cursor.com/docs/models-and-pricing · Google: ai.google.dev

Composer 2's standard tier ($0.50/$2.50) is not the default; the "fast" variant at $1.50/$7.50 is. At that price, it's 2× cheaper than Sonnet 4.6 ($3/$15), not the 10× launch headlines suggested (which compared the standard tier to Claude Opus). DeepSeek V3.2 ($0.28/$0.42) is already cheaper than Composer 2 with comparable coding scores.

The more relevant question isn't "is Composer 2 cheaper than Claude?" It's "what else can you get at the same price, and how does it compare?"

Composer 2 trades blows with the expensive models on coding benchmarks that resist contamination (Terminal-Bench 2.0, SWE-bench Multilingual, LiveCodeBench v6, Next.js Evals).

Competitive on coding benchmarks, despite the price gap

Higher is better. Sorted by score within each panel. Composer 2 and Kimi models in blue.

K2.5: HuggingFace, arXiv 2602.02276 · Composer 2: VentureBeat · SWE-bench: swebench.com · Terminal-Bench: tbench.ai · LiveCodeBench: livecodebench.github.io
Caveats: K2.5 benchmarks are self-reported from model card/paper, not yet on public SWE-bench or Aider leaderboards. SWE-bench Verified excluded: OpenAI retired it in February 2026 citing contamination.

Composer 2 beats Claude Opus 4.6 on Terminal-Bench 2.0 (61.7 vs 59.3). It loses on SWE-bench Multilingual (73.7 vs 77.5). The raw K2.5 base, before Cursor's fine-tuning, leads on LiveCodeBench v6 (85.0 vs 82.2).

Cursor also publishes scores on CursorBench, their proprietary internal evaluation suite. It uses real user sessions sourced via "Cursor Blame" (tracing committed code to agent requests), with tasks averaging 352 lines across 8 files, substantially larger than SWE-bench tasks.

Model	CursorBench	Terminal-Bench 2.0	SWE-bench ML
Composer 2	61.3	61.7	73.7
Claude Opus 4.6	58.2	58.0	77.8
Composer 1.5	44.2	47.9	65.9
Composer 1	38.0	40.0	56.9

Source: cursor.com/blog/cursorbench. CursorBench is proprietary; only Cursor can run it. How CursorBench works →

How much does the harness add?

An agent harness is the scaffolding around the model (file access, tool calls, context retrieval) that turns raw token generation into useful coding work. Most benchmarks score the model and harness together, hiding which is doing the work. If the harness moves scores as much as the model does, Cursor's wrapper is its own value layer, separate from the post-training. Vercel's Next.js Evals (OSS on GitHub) isolates the two:

Model	Agent	Baseline	With AGENTS.md
GPT 5.3 Codex	Codex	86%	100%
GPT 5.4	Codex	86%	95%
Composer 2	Cursor	76%	95%
Gemini 3.1 Pro	Gemini CLI	76%	100%
Claude Opus 4.6	Claude Code	71%	100%
Claude Sonnet 4.6	Claude Code	67%	100%
Kimi K2.5	OpenCode	19%	52%

Source: nextjs.org/evals, github.com/vercel/next-evals-oss

K2.5 scores 19% baseline; Composer 2 scores 76%. That's +57 points from Cursor's RL, far larger than the +0.7 on SWE-bench ML. But with documentation (AGENTS.md), Claude reaches 100% while Composer 2 reaches 95%. The agent harness and context retrieval may matter as much as the model.

For developers: Composer 2 is the better default for scaffolding, CRUD endpoints, and routine refactors: 95% of Claude's quality at half the price. Claude is still worth paying for on architecture decisions, concurrency bugs, and anything safety-critical, where the 5-point gap is the difference between shipping and shipping a regression. Pick per task, not per project.

From open-weight base to production agent

Cursor didn't just wrap Kimi K2.5. They invested 4× the base model's compute in continued pre-training and RL.

Aman Sanger's tweet disclosed the key details: Cursor evaluated multiple base models on perplexity, chose K2.5, then applied "continued pre-training and high-compute RL (a 4× scale-up)."

K2.5 is a substantial release, not an incremental update: a full re-pretrain with expanded context (tech report).

Moonshot trained K2.5 for ~$8.8M. Cursor then ran "4× the compute" on top, which lands at $26M of incremental spend if the 4× includes the base, or $35M if it's 4× on top of it. Total Composer 2: $35–44M. FLOP-to-dollar derivation in methodology →

Stage	Compute	Est. cost	How derived	Who paid
K2.5 pre-training	3.84×10²⁴ FLOPs	~$8.8M	Operation counting from tech report: 8 × 32B × 15T	Moonshot AI
Cursor CPT + RL (applied to K2.5)	~1.0–1.5×10²⁵ FLOPs ≈ 3–4× the K2.5 base	~$26–35M	Sanger's "4×" claim applied to the $8.8M base. Reading B: 4× includes the base → Cursor adds ~$26M. Reading A: 4× on top of the base → Cursor adds ~$35M.	Cursor via Fireworks AI
Total Composer 2	~1.4–1.9×10²⁵ FLOPs ≈ 4–5× the K2.5 base	~$35–44M	$8.8M base + Cursor's scale-up. Reading B: $8.8M + $26M = $35M. Reading A: $8.8M + $35M = $44M.	Moonshot ~20–25% / Cursor ~75–80%

For context, Composer 2's $35–44M sits below Llama 3.1 405B (~$53M est.), an order of magnitude below GPT-4.5 (~$340M est.) and Grok-4 (~$388M est.), and above DeepSeek V3 ($5.6M reported). Final-run costs only; see breakdown →.

The base is substitutable in theory, but switching means re-running the $26–35M RL pipeline on a new foundation with no guarantee the recipe transfers. Interchangeability is real but expensive.

With confirmed benchmarks for all three stages (K2, K2.5, and Composer 2), we can measure exactly what each step contributed:

Each step in the pipeline adds measurable performance

Confirmed scores only. K2 → K2.5 (Moonshot's multimodal retrain) → Composer 2 (Cursor's RL). Opus 4.6 for reference.

K2 Instruct: HuggingFace model card, arXiv 2507.20534, tbench.ai leaderboard · K2.5: arXiv 2602.02276 · Composer 2: Cursor, VentureBeat
All scores confirmed from model cards, tech reports, or public leaderboards. K2 Terminal-Bench 2.0 score (27.8%) from tbench.ai (Terminus 2 agent). SWE-bench Multilingual K2 score (47.3%) from K2 tech report (agentic mode).

Does Composer 2 pay for itself?

At $2B+ ARR, Cursor's gross margin hinges on the model mix. Users pick (or Auto picks for them); the share running through Composer 2 is the margin lever.

Cursor reportedly surpassed $2B in ARR in early 2026. Users can pick any model inside Cursor (Claude, GPT, Composer 2, Gemini), but Cursor's cost per token varies sharply by choice: Composer 2 runs through Fireworks at a fraction of what Cursor pays Anthropic for Claude. If inference consumes 50% of revenue (a common ratio for AI-native products), that's a ~$1B/year line item whose breakdown by model decides margin.

At fast-tier pricing of $1.50/$7.50 (the default), Composer 2's per-token cost is half of Sonnet 4.6's. The mix scenarios below show how that gap compounds:

How model mix would affect margin (thought experiment)

Illustrative scenarios at $2B ARR. Sonnet 4.6 stands in for any premium third-party model Cursor passes through; Composer 2 fast for Cursor-served traffic. Dashed line = 70% SaaS benchmark.

Cursor ARR: TechCrunch, SaaStr · Pricing: Cursor, Anthropic · SaaS benchmark: Bessemer
All figures illustrative. Actual cost structure not publicly disclosed.

Cursor's long-term gross margin is now a single curve: how much of its $2B in usage runs through Composer 2 instead of third-party models like Sonnet, Opus, or GPT-5.4. Every percentage point of share that shifts to Composer 2 fast cuts the per-token cost on that share by roughly half. 50% adoption gets margin to ~63%; 80% touches the 70% SaaS benchmark. The strategic bet isn't on the cost of one model. It's on user behavior bending toward the model Cursor controls.

We don't know Cursor's actual cost structure. These estimates assume 50% of $2B ARR goes to inference, a common ratio for AI-native products but unverified for Cursor. Fireworks' margin (estimated 30–50% gross) is also embedded in Composer 2's pricing.

Who captures the value?

Moonshot built the model. Cursor built the product. The value-capture gap runs the wrong direction.

Two companies, two roles in the same product:

	Moonshot AI Built Kimi K2.5	Cursor Built Composer 2 on top
Valuation	~$18B (target, Series D)	$29.3B
ARR / Revenue	~$500M (est.) 20 days post-K2.5 > all of 2025	$2B+ Doubling every 2–3 months
K2.5 investment	~$8.8M Pre-trained the base model	~$26–35M CPT + RL on top of K2.5

Sources: Moonshot: 36Kr, KR-Asia, TechCrunch · Cursor: TechCrunch, Stripe, DevGraphiq. Moonshot revenue is approximate. Cursor headcount estimated.

At first glance, this looks like Cursor is winning. But look at who moves the needle on capabilities:

Benchmark	Moonshot K2 → K2.5 ($8.8M)	Cursor K2.5 → Composer 2 ($26–35M)
Terminal-Bench 2.0	+23.0 pts	+10.9 pts
SWE-bench Multilingual	+25.7 pts	+0.7 pts
LiveCodeBench v6	+31.3 pts	— (no data)
Avg gain per benchmark	+26.7 pts	+5.8 pts
Cost per point gained	$330K / pt	$4.5–6.0M / pt

Moonshot is 14–18× more efficient at producing benchmark gains. Some of that is diminishing returns (47→73 is easier than 73→74), but it also reflects the fundamental asymmetry: base model training is the hard, underpaid work.

Both companies are winning, on different units. Moonshot's open-weight release of K2.5 triggered a 20-day revenue surge that exceeded all of 2025. Open-weighting was the distribution strategy, not charity. Cursor turned $26–35M of RL into $2B+ of ARR; the wrapper captures the recurring rent. The 14–18× efficiency advantage and the 4× revenue advantage are both real; they sit at different layers of the stack.

The durable question is which layer defends its share. Nathan Lambert (Interconnects) frames the squeeze on Cursor's side: "post-training got more popular because there was more low-hanging fruit. A lot of that potential has been realized." If the wrapper is a thinning layer (and Cursor's +5.8-point average gain vs Moonshot's +26.7 suggests it is), the rent defense gets harder over time. The squeeze on Moonshot's side is the inverse: if labs move up the stack and ship their own application surface (Moonshot already has the Kimi consumer app), the wrapper's distribution moat narrows. Whether Cursor remains the best at picking open-weight bases and bolting RL on top, or whether the labs reclaim the user, is the bet investors are pricing. Time will tell.

Who's exposed?

The model layer is becoming a commodity. The question is who captures the value that used to sit there.

Who	Implication	Signal
Cursor / AI-native apps	Margin expansion, reduced vendor lock-in, model optionality	Positive
Inference providers (Fireworks, Together)	Growing demand as apps shift from API to hosted open-weight	Positive
Anthropic / OpenAI API revenue	Revenue concentration risk if top customers can switch at will	Watch
Open-weight labs (Moonshot, DeepSeek)	Ecosystem adoption, but limited direct monetization	Mixed

The Cursor switch isn't an isolated event. It's the first high-profile instance of a pattern that will repeat: AI-native companies evaluating open-weight bases, applying proprietary fine-tuning, and serving through specialized inference providers, cutting the frontier lab's API out of the loop.

The question for Anthropic and OpenAI isn't whether their models are better. On most benchmarks, they still are. The question is whether "better" justifies 2–5× the price, and for how much longer.

What third parties confirm

Most numbers in this piece come from Cursor, Moonshot, or Fireworks. Here's the subset that independent sources back up.

Signal	Data	Independent source
Developer adoption	Cursor at 18% usage (vs Copilot ~42%)	Stack Overflow 2025, JetBrains 2025
Revenue trajectory	$100M → $500M → $1B → $2B+ ARR (Jan '25 → Feb '26)	Stripe case study, TechCrunch
Enterprise usage	Salesforce: 20K engineers, >90% usage rate	Pragmatic Engineer
Web traffic	cursor.com: #14 in AI tools, #3,004 globally (Oct 2025)	SimilarWeb

What we can't measure, but can infer: Cursor doesn't publish the model mix, but the pricing structure tilts it sharply toward Composer 2. Auto mode and Composer 2 are unmetered on paid plans; Claude and GPT draw from a separate API credit pool that runs down. Most users default to the unmetered path, and developer surveys consistently describe Claude as the escape hatch for hard tasks, not the everyday default. The exact ratio is private; the direction is determined by pricing, and it's the lever behind the margin math in Section 4.

Sources & Methodology

All sources are linked inline where they're cited. Aggregated below for quick reference; full training-cost derivation in the expandable block.

Cursor Changelog 2.0 · HN discussion · Sanger (4× scale-up tweet) · The Decoder · VentureBeat · K2.5 model card · K2.5 tech report · K2 tech report · Anthropic pricing · Cursor pricing · Fireworks Kimi page · DeepSeek pricing · OpenAI SWE-bench retirement · CursorBench · Terminal-Bench · Next.js Evals · TechCrunch ($2B ARR) · Stripe case study · Stack Overflow 2025 · Pragmatic Engineer · DeepSeek V3 cost breakdown · Epoch FLOP estimation

Training cost methodology: FLOP → GPU-hour → dollar

Step 1: Estimate FLOPs from the paper
We use operation counting: C = 8 × N_active × D
• N_active = 32B (paper: "32 billion activated parameters")
• D = 15T tokens (paper: "approximately 15 trillion mixed visual and text tokens")
• 8× multiplier (vs standard 6×) because the paper reports activation checkpointing
• Result: 3.84×10²⁴ FLOPs
• Confidence: Epoch-style "confident" (within 3×). Both N_active and D are explicitly stated with high confidence.

Step 2: Convert FLOPs to GPU-hours
GPU-hours = FLOPs / (peak_FLOPS × MFU × 3600)
• H800 SXM5 peak BF16: 989 TFLOP/s (spec sheet; our model uses 693 TFLOP/s = 70% of H100's 990, a conservative approximation)
• MFU assumption: 35%. Realistic range for large MoE training on H800: 30–45% BF16-equivalent. For reference: DeepSeek V3 achieved ~40% BF16-equivalent MFU in FP8 (analysis). ByteDance's MegaScale-MoE reports up to 47% on Hopper for dense models, lower for MoE (paper).
• K2.5's "90% multimodal training efficiency" is relative to their text-only throughput, not 90% absolute MFU.
• Result: ~4.4M GPU-hours

Step 3: Convert GPU-hours to dollars
Cost = GPU-hours × $/GPU-hour
• Rate: $2.00/GPU-hour, the same assumption as DeepSeek V3's official cost breakdown. This represents amortized cost of owned hardware, not cloud rental. For context: H800 purchase price in China was RMB 190–240K (~$26–33K); at $2/hr a GPU pays for itself in ~16K hours (~2 years). Cloud rental would be significantly higher.
• Note: Tom Goldstein (UMD) has questioned whether $2/hr is realistic: "I dunno about that." Epoch AI uses $1.90–2.20/hr (log-normal, 90% CI).
• Result: ~$8.8M

Sensitivity analysis:

MFU	GPU-hours	Cost @$2/hr
25%	6.2M	$12.3M
30%	5.1M	$10.3M
35% (our est.)	4.4M	$8.8M
40%	3.8M	$7.7M
45%	3.4M	$6.8M

Cross-check: DeepSeek V3
DeepSeek reported 2,788K GPU-hours at $5.576M for 14.8T tokens on 2,048 H800s. Our methodology applied to DeepSeek V3 (using our K2 estimate of 2.98e24 FLOPs) yields ~$6.8M, within 22% of their self-reported figure. The gap comes from DeepSeek's FP8 precision (higher effective throughput) vs our BF16 assumption.

What these costs exclude:
Following DeepSeek's methodology, these figures cover only the final training run. They exclude:
• Research experiments, architecture search, ablation studies
• Data curation and synthetic data generation
• Failed runs and restarts
• Engineering staff costs
Moonshot CEO has stated: "It is hard to quantify the training cost because a major part is research and experiments."

The "4× scale-up" interpretation:
From Aman Sanger's tweet: "continued pre-training and high-compute RL (a 4× scale-up)."
• Reading A — "4× on top": Cursor's CPT+RL = 4× the base cost. Total = $8.8M + $35.2M = ~$44M. Base is ~20% of total.
• Reading B — "4× total": Total compute = 4× the base. Total = 4 × $8.8M = ~$35M. Base is ~25% of total. Matches "about a quarter" press reports better.
• We present both; the press reporting of "about a quarter" for the base favors Reading B.
• This is the final-run cost only. With Sanger noting they've "trained about 50 models," total R&D investment is likely multiples higher.

How CursorBench works (and why we still cite it)

CursorBench sources tasks from real developer sessions via "Cursor Blame," which traces committed code back to the agent request that produced it. Tasks average 352 lines across 8 files, substantially larger than SWE-bench. Grading uses agentic judges and is cross-validated against live traffic metrics. The methodology is best-in-class for measuring whether a model helps developers in an IDE. The problem is verifiability: only Cursor can run it.

CursorBench is not public. A Cursor team member confirmed: "Unfortunately not, as we used our own internal code for the benchmark." Different evaluation harnesses were used per model, so cross-model comparisons are not apples-to-apples. Academic research warns that proprietary benchmarks "shift epistemic authority to the curator."

Composer 2 in context of other frontier models

Model	Reported / Est. cost	Source
DeepSeek V3	$5.6M (reported)	arXiv 2412.19437
Composer 2 (total)	~$35–44M (est.)	Our estimate
Llama 3.1 405B	~$53M (est.)	Our estimate
GPT-4.5	~$340M (est.)	Our estimate
Grok-4	~$388M (est.)	Our estimate

Frontier estimates are order-of-magnitude approximations from publicly available architecture details and training compute. DeepSeek V3 is self-reported.

These are final-run costs only. They exclude research experiments, data curation, failed runs, and engineering staff. Moonshot CEO has stated: "It is hard to quantify the training cost because a major part is research and experiments." Sanger notes Cursor has trained about 50 models; total R&D spend is likely multiples higher.