
TL;DR
A code-heavy field guide to model routing. Real, runnable-style configs for tiering tasks by complexity, routing simple work to open-weights, reserving frontier models for hard reasoning, building failover chains, and keeping prompt caches warm with OpenRouter, LiteLLM, and Factory Router.
| Source | What it covers |
|---|---|
| OpenRouter: Model Fallbacks | The models array and automatic model fallback behaviour |
| OpenRouter: Provider Routing | provider block fields: sort, order, only, max_price, allow_fallbacks |
| OpenRouter: Prompt Caching | Cache-aware sticky routing and cache discount behaviour |
| LiteLLM: Fallbacks | router_settings.fallbacks YAML syntax and fallback types |
| LiteLLM: Auto Routing | Complexity-based tier routing on the proxy |
| Factory: Factory Router | Managed automatic routing for coding agents |
If you have read our model routing and the orchestration layer piece, this is the hands-on companion. That post argues why routing is the control plane of an AI-native stack. This one is the recipe book: concrete configs you can paste, adapt, and ship.
Last verified: June 17, 2026.
Most requests do not need your most expensive model, so route by the cheapest model that can do the job, and only escalate when it cannot.
That sounds obvious. The reason teams overspend anyway is that the default path in almost every SDK is "send everything to the one model I hardcoded." Routing is the discipline of replacing that hardcoded model with a small decision: classify the task, pick a tier, and keep a fallback in your pocket. The savings are not marginal. Sending a one-line commit-message generation request to a frontier model instead of a cheap open-weights model can cost 20 to 50 times more for output that no human can tell apart.
The patterns below build up from the simplest useful thing to a full tiered gateway.
The lowest-effort win is a fallback chain. You name a primary model and a backup. If the primary errors out (rate limit, downtime, a moderation refusal), the request automatically retries on the next model in the list. This is reliability first, but it also lets you put a cheaper model as the primary and a frontier model only as the safety net.
OpenRouter exposes this as a models array on the request body. It walks the list in order and returns the first success.
curl https://openrouter.ai/api/v1/chat/completions \
-H "Authorization: Bearer $OPENROUTER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"models": [
"deepseek/deepseek-v4",
"z-ai/glm-5.2",
"anthropic/claude-sonnet-latest"
],
"messages": [{"role": "user", "content": "Summarize this changelog."}]
}'
Order matters: the list is your priority order, so put the reliable, capable floor model last. OpenRouter's fallbacks parameter (the Anthropic-SDK-compatible field) caps at 3 entries; the models array is the more flexible native form. (OpenRouter docs)
The LiteLLM equivalent lives in the proxy config, where fallbacks are a map from a primary model name to an ordered list of replacements:
# litellm-config.yaml
model_list:
- model_name: deepseek-v4
litellm_params:
model: deepseek/deepseek-chat
api_key: os.environ/DEEPSEEK_API_KEY
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-latest
api_key: os.environ/ANTHROPIC_API_KEY
router_settings:
fallbacks: [{"deepseek-v4": ["claude-sonnet"]}]
LiteLLM also ships specialized fallback types using the same syntax: context_window_fallbacks (escalate when the input overflows the cheap model's window) and content_policy_fallbacks (escalate on a moderation refusal). Those two are quietly the most useful, because they catch the exact cases where a cheap model legitimately cannot finish. (LiteLLM docs)
A fallback chain reacts to failure. Tiering is proactive: you decide up front which class of model a request deserves, so the cheap path is the default and the expensive path is a deliberate choice.
The cleanest mental model is three or four named tiers, each mapped to a model:
| Tier | Use for | Example model |
|---|---|---|
simple | Classification, extraction, short summaries, commit messages | an open-weights small model |
medium | Standard codegen, refactors, structured drafting | DeepSeek V4 / GLM-5.2 |
complex | Multi-step reasoning, ambiguous specs, architecture | a frontier model |
reasoning | Hard math, long-horizon planning, tricky debugging | a frontier reasoning model |
LiteLLM's proxy supports complexity-based auto routing where you declare tiers and let the proxy score the request and pick one:
# litellm-config.yaml (tiered auto routing)
router_settings:
complexity_router_config:
tiers:
simple: glm-5.2-air
medium: deepseek-v4
complex: claude-sonnet
reasoning: claude-opus
If you want full control rather than the proxy's built-in scorer, do the classification yourself with the cheapest model in your fleet, then dispatch. This is the pattern I reach for most because the routing logic is auditable and lives in your code:
TIERS = {
"simple": "glm-5.2-air",
"medium": "deepseek/deepseek-v4",
"complex": "anthropic/claude-sonnet-latest",
"reasoning": "anthropic/claude-opus-latest",
}
def classify(task: str) -> str:
"""Use the cheapest model to bucket the task. One token of output."""
rubric = (
"Reply with exactly one word: simple, medium, complex, or reasoning. "
"simple = extraction/classification/short summary. "
"medium = standard codegen or refactor. "
"complex = ambiguous multi-step work. "
"reasoning = hard math, planning, or subtle debugging.\n\n"
f"TASK:\n{task}"
)
resp = client.chat.completions.create(
model=TIERS["simple"],
messages=[{"role": "user", "content": rubric}],
max_tokens=1,
)
tier = resp.choices[0].message.content.strip().lower()
return tier if tier in TIERS else "medium" # safe default
def route(task: str) -> str:
tier = classify(task)
return client.chat.completions.create(
model=TIERS[tier],
messages=[{"role": "user", "content": task}],
).choices[0].message.content
Two things make this pay off. First, the classifier call is nearly free: one token of output from your cheapest model. Second, the safe default is medium, not complex. When in doubt, you spend like a workhorse, not like a flagship. A miss costs you a slightly worse answer on a cheap model, not a 30x bill, and Pattern 3 catches the genuine misroutes anyway.
A cheaper variant skips the LLM classifier entirely and uses heuristics: input token count, presence of code fences, keywords like "prove", "design", or "why". Heuristics are free and surprisingly good for high-volume pipelines where even a one-token classifier call adds up.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Jun 17, 2026 • 8 min read
Jun 17, 2026 • 6 min read
Jun 17, 2026 • 6 min read
Jun 17, 2026 • 11 min read
Tiering picks a starting point. Escalation handles the case where the cheap model starts but cannot finish well. Combine the two: route to the cheap tier, validate the output, and climb a tier on failure rather than just retrying the same level.
LADDER = ["medium", "complex", "reasoning"]
def is_good_enough(task: str, answer: str) -> bool:
"""Cheap validator: schema check, test run, or a tiny LLM judge."""
if not answer or len(answer) < 10:
return False
# e.g. for codegen: run the generated tests; for JSON: validate the schema
return passes_local_checks(answer)
def route_with_escalation(task: str, start: str = "medium") -> str:
start_idx = LADDER.index(start) if start in LADDER else 0
for tier in LADDER[start_idx:]:
answer = call_model(TIERS[tier], task)
if is_good_enough(task, answer):
return answer
return answer # exhausted the ladder, return best effort
This is essentially what managed routers do under the hood. Factory Router describes exactly this: it picks an efficient model for each Droid session and "moves the session to a more capable model" if the first one struggles, which Factory says cuts token spend 20 to 25 percent while holding frontier-level quality. If you do not want to build and tune the ladder yourself, a managed router buys you that escalation logic. If you do build it, the lever that matters most is your is_good_enough check, because a weak validator either escalates too often (no savings) or too rarely (bad output ships).
Once a model is open-weights, many providers serve it, and prices vary widely. OpenRouter lets you pin a price ceiling and a provider preference per request through a provider block, so you ride the cheapest qualifying host without giving up a quality floor:
curl https://openrouter.ai/api/v1/chat/completions \
-H "Authorization: Bearer $OPENROUTER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "z-ai/glm-5.2",
"provider": {
"sort": "price",
"max_price": { "prompt": 1, "completion": 3 },
"allow_fallbacks": true
},
"messages": [{"role": "user", "content": "Refactor this function."}]
}'
sort: "price" orders providers cheapest-first, max_price (in dollars per million tokens) refuses anyone over your ceiling, and allow_fallbacks: true keeps the request alive if your top pick is down. (OpenRouter provider routing) This is the routing dimension people forget: provider failover and model fallback are two independent decisions. Provider failover steers around an outage on the same model; model fallback swaps to a different model. You usually want both.
The biggest single line item in most agentic workloads is the static prefix you resend on every turn: the system prompt plus injected workspace files. Prompt caching lets the provider keep that prefix warm so cache hits are billed at a steep discount, and on Anthropic models served through OpenRouter that can cut input cost by roughly 90 percent on a hit. (OpenRouter prompt caching)
The routing implication is subtle but important: caching only pays off if subsequent requests land on the same provider that holds the warm cache. OpenRouter handles this with sticky routing: after a cached request, it remembers which provider served you and routes follow-ups for that model back to it. The takeaway for your config is to avoid fighting that stickiness. If you aggressively re-sort providers by price on every single turn of a long agent loop, you can route away from your own warm cache and pay full price on what should have been a cache hit. For long-lived sessions, set the cache breakpoints on your stable prefix and let the gateway keep you on one provider.
# Anthropic-style cache_control on the stable prefix
messages = [
{
"role": "system",
"content": [
{
"type": "text",
"text": LARGE_STATIC_SYSTEM_PROMPT + injected_repo_context,
"cache_control": {"type": "ephemeral"}, # cache this prefix
}
],
},
{"role": "user", "content": turn_specific_question}, # this part varies
]
Pair this with our DeepSeek cache-first agent notes if you are building loops where the same context is read many times.
Here is a single LiteLLM proxy config that combines tiered model groups, fallbacks, and a cost-conscious default. Point your app at the proxy's OpenAI-compatible endpoint and call the tier names like models.
# litellm-config.yaml - unified routing gateway
model_list:
- model_name: tier-simple
litellm_params:
model: openrouter/z-ai/glm-5.2-air
api_key: os.environ/OPENROUTER_API_KEY
- model_name: tier-medium
litellm_params:
model: openrouter/deepseek/deepseek-v4
api_key: os.environ/OPENROUTER_API_KEY
- model_name: tier-complex
litellm_params:
model: anthropic/claude-sonnet-latest
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: tier-reasoning
litellm_params:
model: anthropic/claude-opus-latest
api_key: os.environ/ANTHROPIC_API_KEY
router_settings:
# if a tier fails, climb to the next one up
fallbacks:
- {"tier-simple": ["tier-medium"]}
- {"tier-medium": ["tier-complex"]}
- {"tier-complex": ["tier-reasoning"]}
# overflowing the cheap window? jump to a wide-context model
context_window_fallbacks:
- {"tier-simple": ["tier-complex"]}
num_retries: 2
Your application code stays trivial. It asks for a tier; the gateway owns reliability, escalation, and provider selection:
client = OpenAI(base_url="http://localhost:4000", api_key="sk-litellm")
answer = client.chat.completions.create(
model="tier-simple", # start cheap; the proxy climbs if it must
messages=[{"role": "user", "content": task}],
).choices[0].message.content
| Approach | Best when | Watch out for |
|---|---|---|
Hardcoded fallback models array (OpenRouter) | You want reliability today with one config line | Does not pick cheaper models proactively |
| Self-classified tiers in app code | You want auditable, custom routing logic | You own the classifier and validator quality |
| LiteLLM proxy with tier groups | You run many apps and want one control plane | One more service to operate and monitor |
| Managed router (e.g. Factory Router) | You want escalation tuned for you | Less control; trust the vendor's quality bar |
There is no single right answer. High-volume, well-understood pipelines reward custom tiering because you can tune heuristics to your exact traffic. Agentic coding tools, where task difficulty is wildly variable per session, are exactly where managed escalation earns its keep.
Routing only matters because the price spread between tiers is enormous. Our GLM-5.2 cost math and DeepSeek V4 economics breakdowns show open-weights coding models landing at roughly one-sixth the per-token price of frontier models for comparable coding quality on real benchmarks. If 70 percent of your traffic is genuinely "simple" or "medium" work, routing that share off the frontier is the difference between a sustainable bill and a budget blowout.
Routing is a spend lever, but it is not a spend cap. A misrouted loop or a runaway agent can still burn money fast even on cheap models. Pair every routing config with hard ceilings, per-key budgets, and alerts as described in our spend guardrails playbook. Routing decides which model; guardrails decide when to stop.
Model routing is the practice of choosing which AI model handles each request at runtime instead of hardcoding one model everywhere. The goal is to send cheap, simple work to cheap models and reserve expensive frontier models for the requests that genuinely need them.
Not if you tier carefully and validate. The point of complexity tiering is that simple tasks like extraction or short summaries get identical results from a cheap model. Escalation patterns (Pattern 3) catch the genuine misroutes by climbing to a stronger model when a cheap one cannot pass your quality check.
Use OpenRouter's models array for the quickest reliability win, a self-hosted LiteLLM proxy when you want one auditable control plane across many apps, and a managed router like Factory Router when you want escalation logic tuned for you without building it. They are not mutually exclusive; many teams run LiteLLM in front of OpenRouter.
It depends entirely on your traffic mix and the price spread between tiers. Factory reports its router cuts token spend 20 to 25 percent while holding frontier quality. Teams that route a majority of simple traffic to open-weights models, which run at roughly one-sixth of frontier per-token prices, see larger swings. The savings scale with the share of work that does not need a frontier model.
It is routing that keeps you on the same provider that holds your warm prompt cache. Caching the static prefix (system prompt plus injected files) can cut input costs dramatically, but only on a cache hit. If your router re-shuffles providers every turn, you route away from your own warm cache and lose the discount. Sticky routing avoids that.
Read next
A $500M accidental Claude bill and an open-weights model beating GPT-5.5 at one-sixth the cost point to the same conclusion: the margin is moving to the layer that decides when to use which model for what. Here is how routing and orchestration differ, and how to cut your model spend.
12 min readZ.ai's GLM-5.2 lands as a 753B open-weights coding model that beats GPT-5.5 on SWE-bench Pro for roughly one-sixth the per-token cost. Here is the real cost math, a worked cost-per-task example, and a when-to-use-which decision guide.
9 min readDeepSeek V4 Pro lands a 63.5 on SWE-bench Verified at $1.74/$3.48 per million tokens, and Flash runs agent inner loops for cents. Here is the worked cost math, the Flash-vs-Pro split, and a clear guide on when to route to DeepSeek instead of a frontier model.
9 min readTechnical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Unified API for 200+ models. One API key, one billing dashboard. OpenAI, Anthropic, Google, Meta, Mistral, and more. Aut...
View ToolCentralized manager for MCP servers. Connect once to localhost:37373 and access all your servers through a single endpoi...
View ToolRegistry and hosting platform for MCP servers. 6,000+ servers indexed. One-command install and configuration via CLI. Su...
View ToolLargest MCP server directory with 17,000+ servers. Security grading (A/B/C/F), compatibility scoring, and install config...
View ToolConfigure model, effort, tools, MCP servers, and invocation scope.
Claude CodeConfigure model, tools, MCP, skills, memory, and scoping.
Claude CodeConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI Agents
A $500M accidental Claude bill and an open-weights model beating GPT-5.5 at one-sixth the cost point to the same conclus...

Z.ai's GLM-5.2 lands as a 753B open-weights coding model that beats GPT-5.5 on SWE-bench Pro for roughly one-sixth the p...

DeepSeek V4 Pro lands a 63.5 on SWE-bench Verified at $1.74/$3.48 per million tokens, and Flash runs agent inner loops f...

A company accidentally spent $500M on Claude in one month. Uber torched its whole 2026 AI budget by April. The fix is no...

Anthropic broke its own naming ladder when it introduced the Mythos class and Claude Fable 5. Here is what the shift mea...

Open weights are free to download, but inference is not free to run. Here is the honest break-even math on when self-hos...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.