Market Data 2026-06-06

OpenRouter Weekly Token Rankings:
Billing Data Does Not Lie

If you still pick default models from MMLU leaderboards while your finance team watches OpenRouter invoices climb, you are optimizing for the wrong scoreboard. OpenRouter publishes a 7-day rolling token window that reflects what production agents actually consume—not what vendors claim in launch decks. This article reads that weekly ledger: 28.9 trillion tokens of global traffic, Chinese open weights crossing 45% share, programming workloads jumping from roughly 11% to above 50% of category mix, and the Anthropic paradox where roughly 12% of tokens still produce about 46% of dollar revenue. You will get three numbered pain points, comparison tables, market tiers, hard data citations, and a six-step weekly routing validation loop you can run on a rented Mac without contaminating your laptop.

OpenRouter weekly token rankings and billing truth for agent routing in June 2026

Who

Platform leads, indie agent builders, and Cursor/OpenClaw operators who route through OpenRouter and need a weekly pulse check that survives CFO scrutiny—not a quarterly benchmark blog post.

Problem

All-time cumulative rankings lag reality. Models spike and fade within days; your gateway default may be three release cycles behind what the market already bills for.

Benefit

Translate weekly token share into routing tiers, budget caps, and fallback chains—then prove choices on disposable hardware before you touch production keys.

Structure

Weekly methodology, three pain points, global snapshot tables, Anthropic revenue paradox, stratified tiers, a16z/OpenRouter inversion, and a rented-Mac validation HowTo.

01 · Why the 7-day rolling window beats cumulative hype

OpenRouter aggregates traffic from thousands of applications—IDE plugins, agent gateways, batch pipelines, and experimental chat UIs—then ranks models by tokens processed in the last seven days. That rolling window is the closest public proxy we have to a live commodity exchange for inference. Unlike vendor press releases or static benchmark tables, weekly rankings punish models that look good on paper but fail under sustained agent loops: tool timeouts, context truncation, rate-limit storms, or price shocks that send teams elsewhere overnight.

The distinction matters because model lifecycles accelerated in 2026. DeepSeek V4 Flash did not climb to the top over years; it absorbed share in weeks. Hy3 Preview and Xiaomi MiMo entered the weekly top tier almost as fast. A cumulative all-time chart would still overweight retired GPT-4 era traffic and underweight the current MoE wave. For anyone wiring Cursor Agent Skills or an OpenClaw gateway on a rented Mac, the weekly board is the signal; everything else is narrative.

OpenRouter also segments traffic by use case. The programming category is the clearest example of how fast production mix can flip: share moved from roughly 11% of weekly categorized traffic in early 2025 to more than 50% by June 2026. That is not a gradual trend line—it is agents eating the platform. When more than half of labeled invocations are code-oriented, models that only excel at short Q&A lose rank even if their marketing still leads with general knowledge scores.

Hard data (citable): OpenRouter processed approximately 28.9 trillion tokens globally in the 7-day window ending early June 2026. Chinese-origin models (DeepSeek, Tencent Hy, Xiaomi MiMo, Moonshot Kimi, and related open weights) account for more than 45% of weekly token volume on the aggregator—far above their share on Western-centric benchmark leaderboards.

02 · Three routing pain points (numbered)

1. Benchmark myopia. SWE-bench Verified and Terminal-Bench scores remain useful sanity checks, but they sample curated repos and controlled sandboxes. Weekly OpenRouter volume captures messy reality: partial files, malformed tool JSON, retry loops, and 800K-token context dumps. A model that gains two points on a leaderboard but loses rank on the weekly board is telling you where production traffic already moved. The a16z and OpenRouter joint analysis on benchmark versus market inversion documents the gap explicitly—frontier closed models often dominate eval charts while open MoE stacks capture token share through price and context length.

2. Token share is not dollar share. Anthropic illustrates the paradox cleanly in June 2026 weekly data: roughly 12% of total tokens on OpenRouter still map to about 46% of platform dollar revenue because Claude Opus and Sonnet tiers price output an order of magnitude above DeepSeek V4 Flash or free routes like Owl Alpha. Finance teams care about the revenue-weighted curve; engineering teams stare at token leaders. Without both lenses, you either overspend on premium models for bulk traffic or under-provision quality on tasks that need Opus-grade reasoning.

3. Local experimentation pollutes production state. Rotating five OpenRouter model IDs on the same MacBook that holds your Apple developer certificates, production AWS keys, and daily-driver OpenClaw config is how sandbox prompts leak into real channels. Weekly validation should be repeatable and isolated—same harness, clean environment, archived CSV—before you promote a routing change. That is the same discipline we apply in ds4 local DeepSeek V4 Flash tests: rent, measure, release.

03 · Weekly global snapshot

The table below summarizes platform-level metrics from the early-June 2026 rolling window. Figures are rounded from OpenRouter public stats and third-party summaries; treat them as directional for planning, not audit-grade financials.

Metric7-day valueInterpretation
Global token volume~28.9TSingle-week throughput across all models and routes
China-origin model share45%+DeepSeek, Hy3, MiMo, Kimi, and allied open weights
Programming category share50%+Up from ~11%; agents dominate labeled traffic
Anthropic token share~12%Lower than mindshare; concentrated on premium tiers
Anthropic revenue share (est.)~46%High output pricing on Opus/Sonnet workloads
Free-tier model trafficSignificant minorityOwl Alpha, Nemotron free—prototype gravity wells

Three implications follow immediately. First, any routing policy that ignores Chinese open MoE defaults is fighting the majority of weekly traffic. Second, coding agents are the default workload—models weak on tool calling or long-context code bleed rank fast. Third, premium Western APIs remain economically dominant per dollar even when they lose the token popularity contest—budget caps must be explicit, not assumed from leaderboard position alone.

04 · June 2026 weekly model leaders

Weekly leaders differ from all-time cumulative heroes. The shortlist below reflects 7-day token volume as of early June 2026, not lifetime totals. Volumes are approximate trillions (T) per week.

RankModelWeekly tokensVendorWeekly role
1DeepSeek V4 Flash~3.14TDeepSeekDefault MoE workhorse; 1M context; agent-friendly pricing
2Hy3 Preview~2.75TTencentOpen MoE; efficiency-focused STEM and coding agents
3Xiaomi MiMo~2.1T (est.)XiaomiRising open stack; strong weekly momentum in APAC routes
4Claude Sonnet 4.6~1.8T (est.)AnthropicPremium daily driver; free tier still pulls volume
5DeepSeek V4 Pro~1.5T (est.)DeepSeekHigher reasoning tier; complex agent subtasks
6Gemini 3 Flash Preview~1.2T (est.)GoogleMultimodal coding agents; Google toolchain affinity
7Claude Opus 4.7~1.0T (est.)AnthropicLong-horizon agents; high cost per million output tokens
8Owl Alpha~0.9T (est.)OpenRouterFree stealth route; prototype and education traffic

DeepSeek V4 Flash at roughly 3.14T tokens per week is not a rounding error—it is a plurality on its own. Hy3 at ~2.75T proves Tencent's open MoE line is not a regional side story. MiMo's presence in the weekly top tier signals that handset and consumer-electronics vendors now ship inference-grade open weights that developers route immediately through OpenRouter rather than waiting for Western API equivalents.

Claude Sonnet 4.6 remains entrenched because it balances quality, latency, and a usable free layer for light tasks—but its weekly token count is a fraction of V4 Flash while its per-token revenue contribution is multiples higher. That split is exactly what platform finance models must capture when setting per-team budgets.

Input/output price comparison (weekly planning)

ModelInput $/MOutput $/MContextWeekly fit
DeepSeek V4 Flash~0.10~0.401MHigh-frequency agent loops, bulk coding
Hy3 Preview~0.15 (API est.)~0.60 (API est.)256KOpen MoE; private deploy mirror
Xiaomi MiMoLow / self-hostLow / self-host256K+Cost labs; APAC latency experiments
Claude Sonnet 4.6~3.00~15.00200K–1MQuality gate; customer-facing drafts
Claude Opus 4.7~5.00~25.001M betaLong autonomous tasks; vision-heavy
Owl Alpha001.05MNon-sensitive prototypes only

05 · Token share versus dollar share: the Anthropic paradox

Weekly rankings sort by tokens. Invoices sort by dollars. The two diverge sharply when output pricing spans three orders of magnitude. Anthropic's combined Claude family accounted for roughly 12% of weekly tokens on OpenRouter in early June 2026 while contributing an estimated 46% of gross platform revenue—a gap that confuses teams who only read the leaderboard.

Vendor clusterWeekly token share (est.)Weekly revenue share (est.)Driver
Chinese open MoE (DeepSeek, Hy3, MiMo, Kimi)45%+15–20%Ultra-low $/M; massive context ingestion
Anthropic (Opus + Sonnet)~12%~46%Premium output pricing; long agent sessions
Google Gemini family~10%~12%Multimodal coding; mid-tier pricing
Free / stealth routes (Owl, Nemotron free)~8%~0%Prototype traffic; subsidized experimentation
Other Western APIsRemainderRemainderSpecialty models, legacy routes

Operationally, this means a naive "route everything to the weekly #1" policy minimizes token spend but may sacrifice quality on customer-visible outputs. Conversely, routing everything to Opus because it "feels safer" burns budget on bulk tasks that V4 Flash already handles at weekly scale. The disciplined approach is tiered routing: cheap MoE defaults for inner agent loops, Sonnet for merge-ready code, Opus only when weekly error logs prove the cheaper tiers fail.

06 · Benchmark versus market inversion (a16z × OpenRouter)

The joint a16z and OpenRouter report on inference markets formalized what weekly data already showed: benchmark leadership and market share inverted in 2026. Closed frontier models still top many eval charts—especially on narrow reasoning suites—while open MoE stacks capture token share through context length, tool-call reliability at scale, and aggressive per-million pricing.

Programming's rise from 11% to above 50% of categorized weekly traffic is the mechanism behind the inversion. Coding agents stress different dimensions than chatbots: repository-scale context, repeated tool invocation, diff application, and terminal interaction. A model tuned for exam-style reasoning may score well on MMLU yet hemorrhage weekly rank when developers pipe entire modules through Cursor or OpenClaw every hour.

For Mac and iOS teams, the inversion has a practical consequence. Your Xcode and Swift workflow is now statistically mainstream on OpenRouter, not niche. Models that lack stable function calling or choke beyond 200K tokens will slide in the weekly board even if Twitter still celebrates their launch-day benchmark deltas. Trust the billing window; use benchmarks as secondary filters, not primary routing inputs. For a broader trend narrative (six structural shifts and scenario matrices), see our companion piece on 2026 LLM trends from OpenRouter rankings—this article stays on the weekly ledger and dollar physics.

07 · Market stratification tiers

Weekly volume clusters into four tiers. Use them as routing buckets rather than chasing every new model ID daily.

TierWeekly token bandRepresentative modelsWhen to route here
T1 — Volume kings>2T / weekDeepSeek V4 Flash, Hy3 PreviewDefault agent loops, RAG ingestion, CI bots
T2 — Momentum challengers1–2T / weekMiMo, Sonnet 4.6, V4 ProRegional latency tests; quality step-ups
T3 — Premium specialists0.5–1T / weekOpus 4.7, Gemini 3 FlashLong-horizon tasks, multimodal analysis
T4 — Sandbox / freeHigh tokens, zero revenueOwl Alpha, Nemotron 3 Super (free)Teaching, spikes, non-sensitive prototypes

Tier boundaries are fluid—MiMo could cross into T1 on a strong release week—but the stratification keeps gateway configs maintainable. Document which tier each microservice uses, cap T3 spend separately, and never point production customer data at T4 free routes without reviewing stealth logging policies.

Scenario routing matrix (weekly-aware)

WorkloadPrimary weekly pickFallbackWhy billing agrees
Inner agent tool loop (10+ calls)DeepSeek V4 FlashHy3 PreviewHighest weekly tokens; lowest $/M at scale
PR-ready Swift diffClaude Sonnet 4.6V4 ProQuality tier with moderate weekly volume
12-hour autonomous refactorClaude Opus 4.7Kimi K2.6 (self-host)Premium $/M justified by error cost
Multimodal UI captureGemini 3 FlashOpus 4.7Weekly multimodal coding share growing
Zero-budget hackathonOwl AlphaNemotron 3 Super (free)Token volume without revenue—sandbox only

08 · Six-step weekly routing validation on a rented Mac

Weekly data is perishable. Your validation loop should be too: snapshot, test, integrate, archive—on hardware you can wipe. The steps below extend the five-step pattern with an explicit weekly export so you can diff rank changes every Monday.

  1. Snapshot the weekly leaderboard. Before changing routes, save OpenRouter's 7-day rankings (model ID, weekly tokens, $/M). Store alongside your internal spend CSV so you can correlate platform shift with your own invoice.
  2. Rent an isolated macOS node. Book a Mac mini M4 via bare-metal macOS pricing; use SSH from the daily Mac rental FAQ. Create a local user with no production Apple ID.
  3. Configure sandbox routing keys. Place OPENROUTER_API_KEY in a project-scoped .env. Optionally mirror DeepSeek locally with ds4 per our ds4 inference guide to compare cloud weekly #1 against on-device latency.
  4. Run a fixed weekly benchmark harness. Execute the same agent task—read module, edit test, invoke tool—across your tier shortlist. Log prompt tokens, completion tokens, wall time, USD cost, and tool failures. Aim for at least three runs per model to smooth variance.
  5. Integrate production gateways. Point Cursor, OpenClaw, or your custom gateway at the winning IDs. Stress-test 500K–1M context payloads if your repo warrants it; weekly leaders earn rank partly on surviving those loads.
  6. Archive and release. Commit weekly-routing-YYYYMMDD.csv to your internal docs repo (not public), revoke the test key, and wipe the rental per MacDate's return checklist. Schedule the next snapshot in seven days.
# Weekly OpenRouter probe — run on rented Mac sandbox
export OPENROUTER_API_KEY="sk-or-..."
DATE=$(date +%Y%m%d)
MODELS=(
"deepseek/deepseek-v4-flash"
"tencent/hy3-preview"
"anthropic/claude-sonnet-4.6"
)
for M in "${MODELS[@]}"; do
curl -s https://openrouter.ai/api/v1/chat/completions \
-H "Authorization: Bearer $OPENROUTER_API_KEY" \
-H "Content-Type: application/json" \
-d "{\"model\":\"$M\",\"messages\":[{\"role\":\"user\",\"content\":\"Refactor the auth module tests.\"}]}" \
| tee -a "weekly-bench-$DATE.json"
done
# Parse usage fields for weekly cost tracking (jq required)
jq -s '[.[] | {model: .model, prompt: .usage.prompt_tokens, completion: .usage.completion_tokens}]' \
weekly-bench-$DATE.json > weekly-routing-$DATE.csv

Running this loop on a rented Mac keeps your daily-driver Keychain, npm global CLIs, and OpenClaw production config untouched. It also mirrors how many teams already separate AI workstation TCO experiments from CapEx hardware purchases: convert validation into OpEx, produce evidence, then decide whether to buy a Mac Studio or stay on API routing.

Although you could run the same scripts on a personal MacBook, mixing weekly API experiments with production signing identities is how teams accidentally burn Anthropic quotas on a Tuesday and discover it on invoice day. A disposable macOS node gives you a forensic clean room: if a stealth free model logs prompts, the blast radius stops at the rental. If a new MiMo or Hy3 build drops mid-week, you re-run the harness without uninstalling half your homebrew stack. That operational hygiene is what "billing data does not lie" means in practice—not worshipping the leaderboard, but measuring your own spend against the same weekly window everyone else uses.

When your benchmark CSV shows V4 Flash matching Sonnet on tool success rate at one-fifth the output cost, you have a finance-ready reason to change defaults. When Opus still wins on the twelve-step refactor task, you have a finance-ready reason to keep a T3 tier. Either way, the weekly OpenRouter board gave you the prior; your rented-Mac harness supplied the posterior. That is the decision chain platform leads need in June 2026.

Further Reading