Is OpenRouter weekly token volume more trustworthy than MMLU benchmarks?

Benchmarks measure ceiling ability on fixed datasets; weekly token volume reflects what developers actually pay to route in production. Use both, but bill data validates who is being called at scale.

Why does Anthropic have falling token share but high revenue share?

Claude Opus and Sonnet charge far more per million tokens than DeepSeek Flash. Enterprises still pay premiums for hard reasoning, while high-volume Agent batch jobs have shifted to low-cost models.

Why have Chinese models overtaken US models on weekly OpenRouter volume?

DeepSeek, Tencent Hy3, and MiniMax combine aggressive API pricing with open or permissive licenses that fit Agent and coding workloads. For May 18–24, 2026, China routed about 9.223T weekly tokens versus 4.93T for US models.

How should Mac developers track weekly rankings and ship changes?

Bookmark openrouter.ai/rankings, set primary and fallback models with budgets in OpenClaw or Claude Code, and run Gateway plus OAuth acceptance over VNC on a remote Mac—SSH alone cannot clear macOS permission dialogs.

OpenRouter Weekly Rankings | Bill Data Wins

01

Why bill data is more honest than another benchmark chart

MMLU, HumanEval, and SWE-bench answer a narrow question: on a fixed dataset, what is the model’s ceiling? OpenRouter, one of the largest neutral LLM API aggregators—300+ models, 60+ providers, 8M+ users, roughly 100 trillion tokens per month—publishes rankings built from real routed input plus output tokens. Money and compute do not flatter vendors. Developers vote with wallets for models that are fast enough, stable enough, and cheap enough to leave running overnight.

Agent workflows changed the mix. Programming-related traffic on OpenRouter climbed from about 11% in early 2025 to more than 50% by mid-2026—now the single largest use category. That shift reframes the chart: the models topping weekly volume are not always the ones winning a +1 SWE-bench headline. OpenRouter and a16z’s 2025 AI Usage Report (built on 100T tokens of anonymized metadata) noted an uncomfortable pattern—benchmark scores and market share often move in opposite directions. Expensive flagships do not automatically capture Agent batch traffic; extreme value SKUs do.

1
Benchmarks skew toward ceilings: One-shot runs on static prompts rarely price in dozens of tool rounds and long output chains.
2
Weekly tokens skew toward body temperature: Five straight weeks of growth, as seen in late May 2026, signals real demand expansion—not a launch-week spike.
3
Read two axes: Token share shows who carries traffic; dollar revenue share shows who captures margin. The “king” on each axis is often not the same company.

02

Data source and the 7-day rolling method

All figures below come from the public board at openrouter.ai/rankings, using OpenRouter’s official weekly (7-day rolling token throughput) view. Core dimensions include total weekly tokens (input + output), per-model ranks, vendor market share, and the split between token share vs dollar revenue share—the last is where pricing differences reveal a second truth beneath the headline rank.

Snapshot week: May 18–24, 2026 (the latest complete week shown when this article was drafted). If you read this later, pull live numbers; the workflow still applies.

Scale check: Roughly one year earlier, OpenRouter processed on the order of 2.4 trillion tokens per week. At 28.9 trillion for the May snapshot, that is about 12× year-over-year growth—AI usage has moved from experimentation to sustained, billable throughput.

03

Global weekly totals: 28.9T tokens, fifth straight gain

Metric	Value	Week-over-week
Global weekly volume	28.9T tokens	+7.4% (5th consecutive weekly rise)
China-origin models	9.223T tokens	+19.89%
US-origin models	4.93T tokens	+16.27%
Geopolitical note	Chinese models have led US models on weekly tokens for four straight weeks

Pain points when reading the weekly board:

1
Confusing daily spikes with the weekly roll: The ranking is a 7-day window—do not mix in single-day peaks from your own logs.
2
Ignoring “everything else”: Beyond China and the US, European open-weight stacks and Stealth models still matter; compare vendor pies on the site, not just this table.
3
Deciding on stale months: Hy3 Preview and Owl Alpha can post double-digit weekly deltas; routing policies should refresh weekly, not quarterly.
4
Equating rank #1 with universal default: Top models are usually “ultra-low price × ultra-high throughput.” They are brilliant for Agent loops—not automatic choices for final legal review or multimodal precision work.

04

Model Top 10 for the week ending May 24, 2026

Rank	Model	Vendor	Weekly tokens	WoW	Role
1	DeepSeek-V4-Flash	DeepSeek (China)	3.43T	+66%	Default Agent brain; aggressive pricing
2	Tencent Hy3 Preview	Tencent (China)	3.07T	+16%	Still growing after free tier ended
3	Claude Sonnet 4.6	Anthropic (US)	1.35T	—	Enterprise coding default; 1M context β
4	DeepSeek-V3.2	DeepSeek (China)	1.31T	—	Low-cost long tail; roleplay still active
5	Owl Alpha	OpenRouter (stealth)	1.15T	+29%	Free Agent specialist, ~1M context
6	Gemini 3 Flash Preview	Google (US)	1.06T	—	Multimodal; academic and clinical mixes
7	DeepSeek-V4-Pro	DeepSeek (China)	1.00T	—	Matrix flagship for hard reasoning
8	MiniMax M2.7	MiniMax (China)	806B	—	Long-context value play
9	Grok 4.1 Fast	xAI (US)	721B	—	2M context; strong on legal workloads
10	Step 3.5 Flash	StepFun (China)	673B	—	Batch-friendly speed tier

Source notes: Ranks 1–2 and 5 weekly tokens plus week-over-week changes come from National Business Daily reporting on OpenRouter data for May 18–24, 2026. Ranks 3–4, 6, and 8–10 volumes cross-check the same-week public OpenRouter leaderboard and industry deep-reads. DeepSeek-V4-Pro at 1.00T is derived from the 5.74T series total minus V4-Flash (3.43T) and V3.2 (1.31T). Kimi K2.6, sixth the prior week, dropped out of the top ten and is omitted here.

The DeepSeek matrix, not a single hit

DeepSeek placed V4-Flash, V4-Pro, and V3.2 inside the top nine simultaneously. Combined series volume hit about 5.74 trillion weekly tokens, up 25.9% week over week—second straight week the vendor beat Anthropic and Google on aggregate throughput. Pull quote: Flash carries volume, Pro carries hard jobs, V3.2 catches long-tail routes. That product matrix is eating the Agent wave; it is not a one-model lottery win.

05

Vendor landscape: token volume vs dollar revenue

How fast Chinese models climbed the board

Period	China model traffic share (approx.)
Early 2025	< 2%
February 2026	First week China exceeded US on tokens
May 2026	~45%+; fourth consecutive week ahead of US

Anthropic’s premium paradox

Anthropic’s token share sits near 12%, down from roughly 25% a year earlier, yet dollar revenue share remains near 46%. Enterprises still pay list price for Claude Opus-class reasoning on messy repos and compliance-sensitive workflows—but the token firehose of Agent batch jobs has largely moved to Flash-tier APIs. Traffic leadership has shifted to the value camp; margin pools still sit with premium closed models.

Market tiers (routing matrix)

Tier	Examples	Weekly board signature	Best fit
High value, low volume	Claude Opus	Few tokens, high revenue	Complex reasoning, regulated workflows
Mid value, steady volume	Gemini 3 Flash	Stable multimodal growth	Research, clinical, mixed media
Ultra-low cost, high volume	DeepSeek / Hy3 / MiniMax / StepFun	Top-of-chart dominance	Agents, coding, batch automation

06

Benchmarks and market share: the inversion

While every +1 on SWE-bench earns a blog post, production routers quietly steer bulk traffic toward stacks priced near $0.10 / $0.40 per million input/output tokens. The mechanism is straightforward:

1
Unit cost beats ceiling scores: In multi-turn Agents, output tokens dominate the invoice—developers optimize for SLA and $/M, not bragging rights on static evals.
2
Stability beats one brilliant answer: Tool-call failure rate and p95 latency matter more than an occasional wow moment.
3
Code is the main battlefield: With programming past 50% of OpenRouter traffic, the chart leaders are models that write, edit, and run tests—not chat generalists.

Citable fact: DeepSeek-V4-Flash posted +66% weekly growth in a week without a fresh “new SOTA benchmark” marketing push. The bill moved first; the press release followed later. That is the honest signal weekly rankings provide.

07

Why a weekly public leaderboard matters now

Investors use aggregator throughput to gauge AI commercialization (platform valuations often trade on usage multiples). Developers use it as a vendor-neutral routing reference. Researchers track geopolitical and architectural shifts—MoE, million-token context, Stealth free tiers. Media narratives about “who is winning AI” increasingly cite token volume, not parameter counts on a slide.

Weekly token data has graduated from a niche metric to a commercial weather report: updated every seven days, free to read, yet rarely wired into individual Mac developer workflows. If you run Claude Code or OpenClaw, treating the board like a stock watchlist—check Monday, adjust routes Tuesday—is cheaper than discovering a model shift only when finance forwards the OpenRouter PDF.

08

Mac developers: five steps to track weekly rankings and ship routing

1
Watch bills, not keynotes: Bookmark Rankings; each Monday log the Top 3 models’ week-over-week deltas and compare them to your own OpenRouter usage—divergence is an early warning.
2
Route by scenario: Agent and batch loops → DeepSeek-V4-Flash; enterprise hard reasoning → Claude Opus; multimodal mixes → Gemini 3 Flash. Keep Sonnet 4.6 as a balanced production fallback.
3
Track fast climbers: Hy3 Preview and Owl Alpha’s double-digit weekly gains often preview next quarter’s default “spare brain” before your team formally evaluates them.
4
Set budgets and downgrade paths: In OpenClaw or Claude Code, configure primary, fallback, and escalation models plus per-task token caps so Opus never accidentally eats a batch job.
5
Accept on macOS with a GUI: After changing routes, re-run Gateway health checks, OAuth, and Keychain prompts. SSH alone cannot click system authorization dialogs. Budget 20 minutes on a VNC remote Mac for acceptance (see our OpenClaw rental guide).

Weekly ops checklist: (1) Rankings URL bookmarked; (2) primary / fallback / escalation model names documented; (3) last week’s total tokens and estimated USD; (4) Agent task failure rate; (5) VNC screenshot of Gateway HTTP 200 self-check. All five mean you turned chart awareness into shipped config—not Slack commentary.

Second pull quote: In 2026, the market votes with 28.9 trillion tokens per week, not press-release adjectives. The developers who win are not always on the highest benchmark; they are on the model that survives fifty tool rounds without blowing the sprint budget.

June LLM trends from OpenRouter

Top 10 snapshot, six macro trends, and Mac Agent matrices.

Read →

OpenClaw multi-model routing

openclaw models, cost caps, and fallback wiring.

Read →

Rent a Mac for OpenClaw

7×24 Agents, Ollama, and gateway sizing on M4.

Read →

FAQ

Frequently asked questions

Benchmarks measure ceiling ability on fixed datasets. Weekly token volume shows what developers pay to route at scale. Use both—but let bill data confirm who is actually being called in production.

Claude Opus and Sonnet list prices dwarf Flash-tier APIs. Enterprises still pay for hard reasoning, while Agent batch traffic has migrated to low-cost models—the premium paradox in this article’s title.

DeepSeek, Tencent Hy3, and MiniMax pair aggressive API pricing with licenses that fit Agent and coding workloads. For May 18–24, 2026, China routed about 9.223T weekly tokens versus 4.93T for US models.

Visit Rankings weekly; set primary and fallback models with budgets in OpenClaw or Claude Code; complete Gateway and OAuth acceptance over VNC on a remote Mac. See section 08 for the five-step checklist.

Closing thoughts

The May 18–24 snapshot shows the market voting with money: Chinese open-weight and value-tier APIs are reshaping global routing faster than benchmark seasons update. Who gets called at scale matters more than who scores highest on a static eval—and weekly tokens grew roughly 12× in a year, so treating the board as a routine ops input is rational, not obsessive.

For Mac developers, the hidden tax is often not the API rate—it is a closed laptop killing your gateway, Keychain blocking headless SSH, and OAuth flows that need a real screen. You can pick the right model from the weekly chart and still lose a day if OpenClaw never passes acceptance on your local machine.

Before you capitalize hardware or lock in a single-vendor route, validate primary and fallback pairs on a host that stays awake with GUI access when macOS demands it. VNCMac rents physical Mac mini nodes by the month—use the button below for remote Mac pricing, or compare plans on the homepage first.

OpenRouter Weekly Token RankingsBill Data Doesn’t Lie—Who Really Leads?

Why bill data is more honest than another benchmark chart

Data source and the 7-day rolling method

Global weekly totals: 28.9T tokens, fifth straight gain

Model Top 10 for the week ending May 24, 2026

The DeepSeek matrix, not a single hit

Vendor landscape: token volume vs dollar revenue

How fast Chinese models climbed the board

Anthropic’s premium paradox

Market tiers (routing matrix)

Benchmarks and market share: the inversion

Why a weekly public leaderboard matters now

Mac developers: five steps to track weekly rankings and ship routing

June LLM trends from OpenRouter

OpenClaw multi-model routing

Rent a Mac for OpenClaw

Frequently asked questions

Closing thoughts

OpenRouter Weekly Token Rankings
Bill Data Doesn’t Lie—Who Really Leads?