
TL;DR
No single model wins every task anymore, and the companies that never trained one - Factory, Devin, Perplexity, Cursor, OpenCode - are turning that into a moat. This is how model routing works, why open weights and neoclouds make it cheap, and the honest counter-argument.
| Source | What it covers |
|---|---|
| Factory: Factory Router | Documented automatic model-selection architecture |
| OpenRouter: How model routing works | Provider routing, price ceilings, fallback chains |
| a16z: The State of Generative Media 2026 | "No one model to rule them all," orchestration thesis |
| Simon Willison: GLM-5.2 | Concrete open-weights price spread across providers |
| SemiAnalysis: AI value capture | The counter-argument: value shifting back to labs |
For two years the assumed winners of the AI build-out were the labs with the best single model. That assumption is quietly breaking. By mid-2026 there is a top coding model, a top reasoning model, a top agentic-terminal model, a top open-weights model, and a top value model, and none of them are the same system. When capability fragments like that, the company that picks the right model per task can beat the company that owns any one model. Optionality becomes the product.
This post is about the layer that makes optionality real: the model router. It covers how routing actually works under the hood, why open weights and a new tier of GPU clouds make it cheap, who is building on it (Factory, Cognition's Devin, Perplexity, Cursor, Windsurf, OpenCode), and the honest case that this thesis is overstated.
Last verified: June 20, 2026.
The clearest signal is the spread of per-task leaders. As of mid-June 2026, the independent and vendor-reported benchmark picture looks roughly like this, and the point is the diversity, not any one number:
Cross-check any single figure against Artificial Analysis before quoting it, because vendor-reported scores drift. But the shape is not in dispute. a16z's State of Generative Media 2026 puts it bluntly: there is "no one model to rule them all," and enterprise production deployments already use a median of 14 different models. Their framing of the consequence is the thesis of this whole post: "The unit of work isn't one model, it's a workflow," and the orchestration layer "matters as much as the models themselves."
When releases land at the pace they did this June - several significant open-weights models in the first two weeks alone - betting your product on one model is a standing liability. Routing is how you stop betting.
"Router" gets used loosely, so it helps to separate two distinct jobs, because most real systems do both:

OpenRouter is the cleanest public example of the second job. Its June 2026 write-up describes routing every request across 70-plus providers while the caller controls "the provider order, the price ceiling, and the fallback chain." The motivating failure mode is stated plainly: "If Anthropic rate-limits you mid-traffic, your app shouldn't return a 500." Modes like :nitro (optimize latency) and :floor (optimize price) and an automatic fallback chain turn provider choice into a dial rather than a hard dependency.
Factory is a clear example of the first job, and its documentation is unusually specific about the mechanism. Factory Router runs a fast classifier (around two seconds) over the first user message, recent tool calls, and repository signals, then emits a quality-probability score per candidate model. Those signals are weighted - the message itself, recent tools, repo size, language mix, and difficulty - candidate models are sorted from cheapest to most expensive, and the cheapest model that clears a quality threshold wins. It also documents auto-failover across providers. Factory's own benchmarks claim it holds frontier pass rates while cutting cost by around 25 percent. Treat that figure as vendor-reported: it is measured on Factory's internal suites relative to Claude Opus, not independently verified.
The common pattern across both: spend a cheap model on easy tasks, escalate to an expensive one only when the work demands it, and never let a single provider outage take you down.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Jun 19, 2026 • 8 min read
Jun 19, 2026 • 5 min read
Jun 19, 2026 • 8 min read
Jun 19, 2026 • 8 min read
The companies leaning hardest into this share a trait: none of them trained a frontier model, and they treat that as a feature.
The through-line: when you are not married to a lab, every new model release is an upgrade you get for free instead of a competitor you have to answer.
Routing across labs would be a thin advantage if every model cost the same. Open weights break that, because anyone can host an open model and compete on price. The cleanest evidence comes from Simon Willison's June 17 write-up: GLM-5.2 was available "via OpenRouter, which has it from 9 different providers, almost all charging $1.40/M input and $4.40/M output. For comparison, GPT-5.5 is $5/$30 and Claude Opus 4.5 to 4.8 is $5/$25." Nine independent hosts converging on one low price for one open model is optionality and price competition in a single screenshot. The full provider and pricing breakdown is in the GLM-5.2 access guide.
The supply side of that competition is the "neocloud" tier - GPU clouds like CoreWeave, Lambda, Crusoe, Nebius, and inference specialists like Fireworks, Together, and DeepInfra that rent compute and serve open models against the hyperscalers. This is now a real market, not a fringe: CoreWeave alone reported quarterly revenue around $2.1 billion, roughly doubling year over year. More hosts racing to serve the same open weights is exactly what pushes per-token prices toward the floor. a16z's data backs the incentive: 58 percent of organizations name cost optimization as their primary criterion when choosing model infrastructure, ahead of raw availability or speed.
This thesis can be oversold, and a careful builder should hold the other side too.

The balanced read: optionality is a strong structural position in a fragmented market, not a guaranteed win. It pays off most when no single model dominates - which describes mid-2026 well, and may or may not describe 2027.
The frontier labs will keep mattering. But the quiet winners of this phase may be the companies that never trained a model and instead got very good at choosing one.
A model router is a layer that decides which model should handle a given request and, often, which provider should serve it. Some routers select by task (cheap model for easy work, frontier model for hard work), and some select by provider (cheapest or fastest host, with automatic failover). Tools like Factory Router select by task; OpenRouter primarily routes across providers.
Because no single model is best at everything in 2026, a model-agnostic tool can route each task to the strongest option and adopt new models the day they ship, without re-engineering. Platforms like Devin, Perplexity, Cursor, and OpenCode are built this way, which is how they offered GLM-5.2 access almost immediately after release.
When a model's weights are open, any provider can host it and compete on price. GLM-5.2, for example, was served by nine providers on OpenRouter at roughly $1.40 input and $4.40 output per million tokens, versus $5 / $25 to $30 for comparable closed models. Competition among hosts pushes the price toward the cost of compute.
Neoclouds are GPU-focused cloud providers - CoreWeave, Lambda, Crusoe, Nebius, and inference specialists like Fireworks, Together, and DeepInfra - that rent compute and serve open-weights models in competition with the hyperscalers. They are a big reason open-model inference keeps getting cheaper.
No. The counter-case is real: value may shift back toward whoever trains the best models, total token spend can rise even as per-token prices fall, and routing adds its own failure modes. Optionality is strongest while capability stays fragmented, which describes mid-2026 but is not guaranteed to hold.
Read next
GLM-5.2 ships under an MIT license, so it is hosted everywhere - and a few places run it for free right now. Here is every way to access Z.ai's open-weights coding model, from free tiers in Devin and Hugging Face to the cheapest per-token routes on OpenRouter, Fireworks, and DeepInfra, plus local Ollama.
9 min readOpenCode is the fastest-growing open-source AI coding agent - 160K GitHub stars, 7.5M monthly users, 75+ model providers. Here is how to set it up, configure models, and use it effectively in your workflow.
11 min readSame-day-verified llm api pricing june 2026: Claude Fable 5, GPT-5.5, Gemini 3.1 Pro, and DeepSeek V4 compared per million tokens, plus the three caveats that change the math.
10 min readTechnical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
DeepSeek's open-weights frontier family, previewed April 24, 2026. V4-Pro is 1.6T total / 49B active params; V4-Flash is...
View ToolOpen-source terminal agent runtime with approval modes, rollback snapshots, MCP servers, LSP diagnostics, and a headless...
View ToolOpenAI's flagship. GPT-4o for general use, o3 for reasoning, Codex for coding. 300M+ weekly users. Tasks, agents, web br...
View ToolUnified API for 200+ models. One API key, one billing dashboard. OpenAI, Anthropic, Google, Meta, Mistral, and more. Aut...
View ToolConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI AgentsWhat MCP servers are, how they work, and how to build your own in 5 minutes.
AI AgentsInstall Ollama and LM Studio, pull your first model, and run AI locally for coding, chat, and automation - with zero cloud dependency.
Getting Started
GLM-5.2 ships under an MIT license, so it is hosted everywhere - and a few places run it for free right now. Here is eve...

OpenCode is the fastest-growing open-source AI coding agent - 160K GitHub stars, 7.5M monthly users, 75+ model providers...

Same-day-verified llm api pricing june 2026: Claude Fable 5, GPT-5.5, Gemini 3.1 Pro, and DeepSeek V4 compared per milli...

Z.ai's GLM-5.2 lands as a 753B open-weights coding model that beats GPT-5.5 on SWE-bench Pro for roughly one-sixth the p...

JetBrains released Mellum2 on June 2, 2026 - a 12B MoE model with only 2.5B active parameters per token. Here is how to...

DeepSeek V4 Pro lands a 63.5 on SWE-bench Verified at $0.435/$0.87 per million tokens, and Flash runs agent inner loops...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.