
TL;DR
A $500M accidental Claude bill and an open-weights model beating GPT-5.5 at one-sixth the cost point to the same conclusion: the margin is moving to the layer that decides when to use which model for what. Here is how routing and orchestration differ, and how to cut your model spend.
| Topic | Source |
|---|---|
| $500M one-month enterprise Claude bill | cybernews.com |
| Anthropic run-rate and enterprise spend | Inc. / Fast Company |
| GLM-5.2 benchmarks and cost | VentureBeat, InfoWorld |
| Factory Router | factory.ai/news/factory-router |
| Perplexity / Aravind Srinivas on orchestration | 20VC, Fortune |
A short version for people who are budgeting right now: most of your tasks do not need a frontier model. The work that decides which model handles which task, and reserves the expensive model for the hard minority, is where the cost savings and the defensibility now live. This post is about that layer, why it is suddenly worth building on top of, and how to start cutting your own spend.
In late May 2026, reports surfaced that an enterprise client ran up roughly a $500 million Claude bill in a single month because it never set per-employee usage caps. Engineers pointed agents at frontier models, the agents looped, and nobody put a ceiling on any of it (cybernews). That is an extreme case, but it is not an isolated one. Uber reportedly burned through its entire 2026 AI budget by April. Microsoft scaled back Claude Code licenses. Anthropic is at roughly a $30 billion annualized run-rate, up from about $9 billion at the end of 2025, with more than 1,000 companies spending over $1M a year (Inc.).
Read those numbers together and a pattern jumps out. The labs are not the ones with a cost problem. They are the ones collecting the bill. The cost problem belongs to everyone building on top of them, and it is getting worse precisely because frontier models are good enough that teams reach for them by default, for everything, without asking whether the task actually needs that horsepower.
That default is the expensive habit. And the thing that fixes it is not a cheaper frontier model. It is a layer that decides, per task, whether you need the frontier model at all.
For a while the argument for always using the best model was simple: the open and cheap models were not good enough for real work, so the price difference did not matter. As of June 2026 that argument is dead.
On June 16, Z.ai released GLM-5.2, a 753-billion-parameter open-weights model under an MIT license. On SWE-bench Pro it scored 62.1, ahead of GPT-5.5 at 58.6 on a long-horizon autonomous coding benchmark. It does this at roughly one-sixth the per-token cost (VentureBeat, InfoWorld).
Sit with that. An open-weights model you can run yourself, or rent for a fraction of the price, beats a frontier proprietary model on a hard agentic coding benchmark. The quality argument for routing everything to the most expensive option no longer holds. When the cheap model is sometimes the better model, paying frontier prices for every token is not caution, it is waste.
This is the structural shift. The performance curves of open and frontier models have converged enough that the interesting question is no longer "which single model is best." It is "which model is best for this specific task, at this moment, given what it costs." That is a routing question, and routing questions need a routing layer.
People use "routing" and "orchestration" interchangeably, and they should not. The difference is the whole point of this post.
Routing picks a model per request. Same task, one model, chosen well. It is a substitution problem: given this prompt and these constraints, which model gives me the best result per dollar? Do the substitution invisibly and the user never knows or cares which model answered.
The cleanest current example is Factory Router. It routes each Droid coding session across models and providers - Claude, DeepSeek, and others - choosing the optimal model per task from a pool of frontier and efficient options. Factory reports roughly 20-25% token savings while holding frontier-level performance, and 99.9%+ request reliability through failover: if a model struggles, the session escalates to a more capable one, and if a provider path goes down, the session keeps running through a healthy path. The droids are model-agnostic and the system is self-learning. That is routing done well - quietly swapping the cheaper model in wherever it is good enough, escalating only when it is not.
Orchestration is a larger claim. It does not just pick a model, it decides how the work itself is decomposed: which model, how many agents, how they collaborate, what runs locally versus in the cloud, when to call a tool versus call a model. Routing is a subroutine inside orchestration.
Perplexity's Aravind Srinivas has been the loudest voice here. On Harry Stebbings' 20VC he put it bluntly: "The orchestration is the product. The model is a tool." Perplexity's "Computer" agent does not just route to the cheapest acceptable model. It orchestrates which model handles which sub-task, how multiple agents coordinate, and whether work runs on-device or in the cloud - what Srinivas calls an "omni agent" rather than a router (Fortune, 20VC). His preferred metric tells you everything about where his head is: not tokens, not latency, but "token value per watt per user" - useful output, normalized by energy and by person. That is an orchestration metric, not a routing one.
The short way to hold the two apart: routing optimizes the choice within a fixed shape of work. Orchestration optimizes the shape of the work itself. Factory Router is best-in-class at the first. Perplexity's Computer is aiming at the second.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Jun 17, 2026 • 10 min read
Jun 17, 2026 • 11 min read
Jun 17, 2026 • 8 min read
Jun 17, 2026 • 9 min read
Here is the argument. The labs are going to keep winning the model race, and they are going to keep capturing enormous revenue doing it. You are not going to out-train Anthropic or OpenAI, and you should not try. But the labs have a structural blind spot, and it is the same one that produced the $500M bill: they make money when you use more of their most expensive model, so they are not the party with the strongest incentive to help you use less of it.
That misaligned incentive is the opening. The orchestration layer wins by doing the thing the labs are not motivated to do: route the cheap, open, good-enough model for the roughly 80% of tasks that do not need frontier reasoning, and reserve the frontier model for the hard 20% where it actually changes the outcome. The value created is the delta between the all-frontier bill and the orchestrated bill, and as the cost gap widens - GLM-5.2 at one-sixth the price is just the latest data point - that delta gets bigger every quarter.
This is a defensible place to build for a few reasons:
The labs sell horsepower. The orchestration layer sells judgment about when you need it. In a world where horsepower is abundant and cheap horsepower is suddenly competitive, judgment is the scarce thing.
The decision is less mysterious than it sounds. At its core it is a classifier in front of your model call. Here is the shape of it in pseudo-code, with the escalation pattern Factory uses baked in:
def route(task):
# 1. Cheap, fast triage - estimate task difficulty
difficulty = classify_difficulty(task) # a small/cheap model or heuristic
# 2. Route the easy majority to the cheap, good-enough model
if difficulty < THRESHOLD:
result = call_model("glm-5.2", task) # ~1/6 the cost
if quality_ok(result, task):
return result
# 3. Escalate only on failure - the hard minority
return call_model("frontier-model", task) # reserved for the 20%
# 4. High-difficulty tasks go straight to frontier
return call_model("frontier-model", task)
Three ideas do most of the work here. First, triage cheaply - the classifier deciding difficulty should itself be small or heuristic, or you have just moved the cost, not removed it. Second, default to cheap and escalate on failure rather than defaulting to frontier and hoping. Third, measure quality, because the whole scheme depends on knowing when the cheap model was not good enough. Without a quality signal you are flying blind, and silent quality regressions are how routing projects lose trust.
Orchestration extends this. Instead of one task and one model choice, you decompose a job into sub-tasks, route each one, run some in parallel, run some locally to save on tokens and latency, and have one agent check another's output before you accept it. The routing decision above is the atom. Orchestration is the molecule.
You do not need to build Perplexity's Computer to benefit from this. In rough order of effort and impact:
The labs built the engines. The interesting work now is in the layer that decides which engine to start, and when to leave it off. As the bills climb and the cheap models get good, that layer stops being a nice-to-have optimization and starts being the product.
AI model routing is a layer that picks which model handles each request based on the task's difficulty and cost. Instead of sending every prompt to the most expensive frontier model, a router sends the easy majority to a cheaper or open-weights model and reserves the frontier model for the hard minority. Factory Router, for example, reports 20-25% token savings on coding sessions while holding frontier-level performance. The savings scale with the cost gap between models, which is widening as open-weights models like GLM-5.2 reach frontier-level quality at a fraction of the price.
Routing picks the best model for a single request - it optimizes the choice within a fixed shape of work. Orchestration is larger: it decides how the work itself is decomposed, how many agents run, how they collaborate, what runs locally versus in the cloud, and when to call a tool versus a model. Routing is a subroutine inside orchestration. Factory Router is a strong example of routing; Perplexity's "Computer" agent, which its CEO describes as an "omni agent," is aiming at full orchestration.
Start with hard per-user and per-project spend caps in your provider dashboard - the reported $500M one-month Claude bill happened because none were set. Then measure what fraction of your calls actually need a frontier model; most teams overestimate it. Adopt a router to automatically send easy tasks to cheaper models, default to open-weights models for the majority of work, and build a quality gate so you escalate to the frontier model only when the cheap one falls short.
For a large and growing class of tasks, yes. As of June 2026, Z.ai's open-weights GLM-5.2 scored 62.1 on SWE-bench Pro, ahead of GPT-5.5 at 58.6, at roughly one-sixth the per-token cost. The right approach is not all-or-nothing: route the easy majority of tasks to the cheaper model, measure quality, and escalate to a frontier model only for the hard minority where it changes the outcome.
The labs profit when you use more of their most expensive model, so they have little incentive to help you use less of it. That misalignment is the opening. An orchestration layer is model-agnostic, so it gets more valuable as more models are released rather than being threatened by them. It compounds with data - every routed task teaches it which model fits which work - and its incentives are aligned with the customer because it makes money by saving the customer money. In a year of escalating AI bills, that is an easy sale.
Read next
Five managed-agent providers, five pricing models, zero unified cost attribution. If you're running agents overnight, you need FinOps you don't have yet.
13 min readEvery major AI coding tool just went through a pricing shift. Here are the exact numbers for Cursor, GitHub Copilot, Claude Code, Devin, and the Anthropic API - verified from live pricing pages on June 17, 2026. Only 5 days until the Fable 5 deadline.
9 min readThe viral DN42 AWS bill story is funny until you realize the missing primitive: infrastructure agents need hard cloud-spend guardrails before they touch real accounts.
8 min readTechnical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Open-source reasoning models from China. DeepSeek-R1 rivals o1 on math and code benchmarks. V3 for general use. Fully op...
View ToolError monitoring and performance tracing with release tracking, session replay, and first-class Next.js support.
View ToolAI-native code editor forked from VS Code. Composer mode rewrites multiple files at once. Tab autocomplete predicts your...
View ToolDeployment platform behind Next.js. Git push to deploy. Edge functions, image optimization, analytics. Free tier is gene...
View ToolSchedule jobs in plain English. See what ran, what broke, what's next.
View AppTry AI models in the browser before paying for a single token.
View AppAnswer a few task questions and get a practical model recommendation with cost and latency tradeoffs.
View AppA complete, citation-backed Claude Code course with setup, prompting systems, MCP, CI, security, cost controls, and capstone workflows.
ai-developmentInstall Ollama and LM Studio, pull your first model, and run AI locally for coding, chat, and automation - with zero cloud dependency.
Getting StartedDefine custom subagent types within your project's memory layer.
Claude Code
Five managed-agent providers, five pricing models, zero unified cost attribution. If you're running agents overnight, yo...

Every major AI coding tool just went through a pricing shift. Here are the exact numbers for Cursor, GitHub Copilot, Cla...

The viral DN42 AWS bill story is funny until you realize the missing primitive: infrastructure agents need hard cloud-sp...

AI coding agents have crossed from demo to daily workflow. The next bottleneck is not demand. It is cost attribution, bu...

A hands-on, beginner-friendly walkthrough of building an AI agent with Vercel eve: scaffold the project, define an agent...

At its Compile conference, Cursor announced Origin: a Git-compatible code hosting platform designed around AI agents as...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.