Model Routing Recipes: Practical Config Patterns to Cut AI Spend

Q: Should I use OpenRouter, LiteLLM, or a managed router?

Use OpenRouter's `models` array for the quickest reliability win, a self-hosted LiteLLM proxy when you want one auditable control plane across many apps, and a managed router like Factory Router when you want escalation logic tuned for you without building it. They are not mutually exclusive; many teams run LiteLLM in front of OpenRouter.

Last updated: June 24, 2026

Official Sources#

Source	What it covers
OpenRouter: Model Fallbacks	The `models` array and automatic model fallback behaviour
OpenRouter: Provider Routing	`provider` block fields: `sort`, `order`, `only`, `max_price`, `allow_fallbacks`
OpenRouter: Prompt Caching	Cache-aware sticky routing and cache discount behaviour
LiteLLM: Fallbacks	`router_settings.fallbacks` YAML syntax and fallback types
LiteLLM: Auto Routing	Complexity-based tier routing on the proxy
LiteLLM: Routing	Router configuration and model group behavior
OpenRouter: Provider Routing	Current provider routing concepts and provider selection controls
Vercel AI Gateway	Managed gateway pattern for model access, observability, and provider routing
Factory: Factory Router	Managed automatic routing for coding agents

If you have read our model routing and the orchestration layer piece, this is the hands-on companion. That post argues why routing is the control plane of an AI-native stack. This one is the recipe book: concrete configs you can paste, adapt, and ship.

For the tool-selection layer, pair this with LLM routers compared. For the infrastructure version, read Models.dev model routing and Envoy AI Gateway.

The Core Idea in One Sentence#

Most requests do not need your most expensive model, so route by the cheapest model that can do the job, and only escalate when it cannot.

That sounds obvious. The reason teams overspend anyway is that the default path in almost every SDK is "send everything to the one model I hardcoded." Routing is the discipline of replacing that hardcoded model with a small decision: classify the task, pick a tier, and keep a fallback in your pocket. The savings are not marginal. Sending a one-line commit-message generation request to a frontier model instead of a cheap open-weights model can cost 20 to 50 times more for output that no human can tell apart.

The patterns below build up from the simplest useful thing to a full tiered gateway.

Pattern 1: Static Fallback Chain (the floor)#

The lowest-effort win is a fallback chain. You name a primary model and a backup. If the primary errors out (rate limit, downtime, a moderation refusal), the request automatically retries on the next model in the list. This is reliability first, but it also lets you put a cheaper model as the primary and a frontier model only as the safety net.

OpenRouter exposes this as a models array on the request body. It walks the list in order and returns the first success.

Terminal

curl https://openrouter.ai/api/v1/chat/completions \
  -H "Authorization: Bearer $OPENROUTER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "models": [
      "deepseek/deepseek-v4",
      "z-ai/glm-5.2",
      "anthropic/claude-sonnet-latest"
    ],
    "messages": [{"role": "user", "content": "Summarize this changelog."}]
  }'

Order matters: the list is your priority order, so put the reliable, capable floor model last. OpenRouter's fallbacks parameter (the Anthropic-SDK-compatible field) caps at 3 entries; the models array is the more flexible native form. (OpenRouter docs)

The LiteLLM equivalent lives in the proxy config, where fallbacks are a map from a primary model name to an ordered list of replacements:

YAML

# litellm-config.yaml
model_list:
  - model_name: deepseek-v4
    litellm_params:
      model: deepseek/deepseek-chat
      api_key: os.environ/DEEPSEEK_API_KEY
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-latest
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  fallbacks: [{"deepseek-v4": ["claude-sonnet"]}]

LiteLLM also ships specialized fallback types using the same syntax: context_window_fallbacks (escalate when the input overflows the cheap model's window) and content_policy_fallbacks (escalate on a moderation refusal). Those two are quietly the most useful, because they catch the exact cases where a cheap model legitimately cannot finish. (LiteLLM docs)

Pattern 2: Tier by Task Complexity#

A fallback chain reacts to failure. Tiering is proactive: you decide up front which class of model a request deserves, so the cheap path is the default and the expensive path is a deliberate choice.

The cleanest mental model is three or four named tiers, each mapped to a model:

Tier	Use for	Example model
`simple`	Classification, extraction, short summaries, commit messages	an open-weights small model
`medium`	Standard codegen, refactors, structured drafting	DeepSeek V4 / GLM-5.2
`complex`	Multi-step reasoning, ambiguous specs, architecture	a frontier model
`reasoning`	Hard math, long-horizon planning, tricky debugging	a frontier reasoning model

LiteLLM's proxy supports complexity-based auto routing where you declare tiers and let the proxy score the request and pick one:

YAML

# litellm-config.yaml (tiered auto routing)
router_settings:
  complexity_router_config:
    tiers:
      simple:    glm-5.2-air
      medium:    deepseek-v4
      complex:   claude-sonnet
      reasoning: claude-opus

(LiteLLM auto routing)

If you want full control rather than the proxy's built-in scorer, do the classification yourself with the cheapest model in your fleet, then dispatch. This is the pattern I reach for most because the routing logic is auditable and lives in your code:

Python

TIERS = {
    "simple":    "glm-5.2-air",
    "medium":    "deepseek/deepseek-v4",
    "complex":   "anthropic/claude-sonnet-latest",
    "reasoning": "anthropic/claude-opus-latest",
}

def classify(task: str) -> str:
    """Use the cheapest model to bucket the task. One token of output."""
    rubric = (
        "Reply with exactly one word: simple, medium, complex, or reasoning. "
        "simple = extraction/classification/short summary. "
        "medium = standard codegen or refactor. "
        "complex = ambiguous multi-step work. "
        "reasoning = hard math, planning, or subtle debugging.\n\n"
        f"TASK:\n{task}"
    )
    resp = client.chat.completions.create(
        model=TIERS["simple"],
        messages=[{"role": "user", "content": rubric}],
        max_tokens=1,
    )
    tier = resp.choices[0].message.content.strip().lower()
    return tier if tier in TIERS else "medium"  # safe default

def route(task: str) -> str:
    tier = classify(task)
    return client.chat.completions.create(
        model=TIERS[tier],
        messages=[{"role": "user", "content": task}],
    ).choices[0].message.content

Two things make this pay off. First, the classifier call is nearly free: one token of output from your cheapest model. Second, the safe default is medium, not complex. When in doubt, you spend like a workhorse, not like a flagship. A miss costs you a slightly worse answer on a cheap model, not a 30x bill, and Pattern 3 catches the genuine misroutes anyway.

A cheaper variant skips the LLM classifier entirely and uses heuristics: input token count, presence of code fences, keywords like "prove", "design", or "why". Heuristics are free and surprisingly good for high-volume pipelines where even a one-token classifier call adds up.

From the archive

Omnigent: Databricks' Meta-Harness for Orchestrating Claude Code, Codex, and Custom Agents

Jun 17, 2026 • 8 min read

Codex Gets Computer Use in the EU - and a Clean Claude Code Import

Jun 17, 2026 • 7 min read

'The Orchestration Is the Product': What Perplexity's Aravind Srinivas Sees That the Model Labs Don't

Jun 17, 2026 • 11 min read

RFC 10008: The New HTTP QUERY Method Explained

Jun 17, 2026 • 6 min read

Pattern 3: Tier With Escalation (failover that climbs)#

Tiering picks a starting point. Escalation handles the case where the cheap model starts but cannot finish well. Combine the two: route to the cheap tier, validate the output, and climb a tier on failure rather than just retrying the same level.

Python

LADDER = ["medium", "complex", "reasoning"]

def is_good_enough(task: str, answer: str) -> bool:
    """Cheap validator: schema check, test run, or a tiny LLM judge."""
    if not answer or len(answer) < 10:
        return False
    # e.g. for codegen: run the generated tests; for JSON: validate the schema
    return passes_local_checks(answer)

def route_with_escalation(task: str, start: str = "medium") -> str:
    start_idx = LADDER.index(start) if start in LADDER else 0
    for tier in LADDER[start_idx:]:
        answer = call_model(TIERS[tier], task)
        if is_good_enough(task, answer):
            return answer
    return answer  # exhausted the ladder, return best effort

This is essentially what managed routers do under the hood. Factory Router describes exactly this: it picks an efficient model for each Droid session and "moves the session to a more capable model" if the first one struggles, which Factory says cuts token spend 20 to 25 percent while holding frontier-level quality. If you do not want to build and tune the ladder yourself, a managed router buys you that escalation logic. If you do build it, the lever that matters most is your is_good_enough check, because a weak validator either escalates too often (no savings) or too rarely (bad output ships).

Pattern 4: Cost-Ceiling and Provider Routing#

Once a model is open-weights, many providers serve it, and prices vary widely. OpenRouter lets you pin a price ceiling and a provider preference per request through a provider block, so you ride the cheapest qualifying host without giving up a quality floor:

Terminal

curl https://openrouter.ai/api/v1/chat/completions \
  -H "Authorization: Bearer $OPENROUTER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "z-ai/glm-5.2",
    "provider": {
      "sort": "price",
      "max_price": { "prompt": 1, "completion": 3 },
      "allow_fallbacks": true
    },
    "messages": [{"role": "user", "content": "Refactor this function."}]
  }'

sort: "price" orders providers cheapest-first, max_price (in dollars per million tokens) refuses anyone over your ceiling, and allow_fallbacks: true keeps the request alive if your top pick is down. (OpenRouter provider routing) This is the routing dimension people forget: provider failover and model fallback are two independent decisions. Provider failover steers around an outage on the same model; model fallback swaps to a different model. You usually want both.

Pattern 5: Cache-Aware Routing (the cheapest token is the one you skip)#

The biggest single line item in most agentic workloads is the static prefix you resend on every turn: the system prompt plus injected workspace files. Prompt caching lets the provider keep that prefix warm so cache hits are billed at a steep discount, and on Anthropic models served through OpenRouter that can cut input cost by roughly 90 percent on a hit. (OpenRouter prompt caching)

The routing implication is subtle but important: caching only pays off if subsequent requests land on the same provider that holds the warm cache. OpenRouter handles this with sticky routing: after a cached request, it remembers which provider served you and routes follow-ups for that model back to it. The takeaway for your config is to avoid fighting that stickiness. If you aggressively re-sort providers by price on every single turn of a long agent loop, you can route away from your own warm cache and pay full price on what should have been a cache hit. For long-lived sessions, set the cache breakpoints on your stable prefix and let the gateway keep you on one provider.

Python

# Anthropic-style cache_control on the stable prefix
messages = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": LARGE_STATIC_SYSTEM_PROMPT + injected_repo_context,
                "cache_control": {"type": "ephemeral"},  # cache this prefix
            }
        ],
    },
    {"role": "user", "content": turn_specific_question},  # this part varies
]

Pair this with our DeepSeek cache-first agent notes if you are building loops where the same context is read many times.

Putting It Together: a Reference Gateway Config#

Here is a single LiteLLM proxy config that combines tiered model groups, fallbacks, and a cost-conscious default. Point your app at the proxy's OpenAI-compatible endpoint and call the tier names like models.

YAML

# litellm-config.yaml  -  unified routing gateway
model_list:
  - model_name: tier-simple
    litellm_params:
      model: openrouter/z-ai/glm-5.2-air
      api_key: os.environ/OPENROUTER_API_KEY
  - model_name: tier-medium
    litellm_params:
      model: openrouter/deepseek/deepseek-v4
      api_key: os.environ/OPENROUTER_API_KEY
  - model_name: tier-complex
    litellm_params:
      model: anthropic/claude-sonnet-latest
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: tier-reasoning
    litellm_params:
      model: anthropic/claude-opus-latest
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  # if a tier fails, climb to the next one up
  fallbacks:
    - {"tier-simple":  ["tier-medium"]}
    - {"tier-medium":  ["tier-complex"]}
    - {"tier-complex": ["tier-reasoning"]}
  # overflowing the cheap window? jump to a wide-context model
  context_window_fallbacks:
    - {"tier-simple": ["tier-complex"]}
  num_retries: 2

Your application code stays trivial. It asks for a tier; the gateway owns reliability, escalation, and provider selection:

Python

client = OpenAI(base_url="http://localhost:4000", api_key="sk-litellm")

answer = client.chat.completions.create(
    model="tier-simple",          # start cheap; the proxy climbs if it must
    messages=[{"role": "user", "content": task}],
).choices[0].message.content

When to Build vs. Buy#

Approach	Best when	Watch out for
Hardcoded fallback `models` array (OpenRouter)	You want reliability today with one config line	Does not pick cheaper models proactively
Self-classified tiers in app code	You want auditable, custom routing logic	You own the classifier and validator quality
LiteLLM proxy with tier groups	You run many apps and want one control plane	One more service to operate and monitor
Managed router (e.g. Factory Router)	You want escalation tuned for you	Less control; trust the vendor's quality bar

There is no single right answer. High-volume, well-understood pipelines reward custom tiering because you can tune heuristics to your exact traffic. Agentic coding tools, where task difficulty is wildly variable per session, are exactly where managed escalation earns its keep.

The Cost Math, Briefly#

Routing only matters because the price spread between tiers is enormous. Our GLM-5.2 cost math and DeepSeek V4 economics breakdowns show open-weights coding models landing at roughly one-sixth the per-token price of frontier models for comparable coding quality on real benchmarks. If 70 percent of your traffic is genuinely "simple" or "medium" work, routing that share off the frontier is the difference between a sustainable bill and a budget blowout.

Routing is a spend lever, but it is not a spend cap. A misrouted loop or a runaway agent can still burn money fast even on cheap models. Pair every routing config with hard ceilings, per-key budgets, and alerts as described in our spend guardrails playbook. Routing decides which model; guardrails decide when to stop.

The most practical next step is not to route everything. Pick one high-volume workflow, classify its requests into simple, medium, and complex, and measure how often each tier passes local checks. Then decide whether that belongs in app code, OpenRouter, LiteLLM, or a managed router.

FAQ#

What is model routing in plain terms?#

Model routing is the practice of choosing which AI model handles each request at runtime instead of hardcoding one model everywhere. The goal is to send cheap, simple work to cheap models and reserve expensive frontier models for the requests that genuinely need them.

Does routing hurt output quality?#

Not if you tier carefully and validate. The point of complexity tiering is that simple tasks like extraction or short summaries get identical results from a cheap model. Escalation patterns (Pattern 3) catch the genuine misroutes by climbing to a stronger model when a cheap one cannot pass your quality check.

Should I use OpenRouter, LiteLLM, or a managed router?#

Use OpenRouter's models array for the quickest reliability win, a self-hosted LiteLLM proxy when you want one auditable control plane across many apps, and a managed router like Factory Router when you want escalation logic tuned for you without building it. They are not mutually exclusive; many teams run LiteLLM in front of OpenRouter.

How much can routing actually save?#

It depends entirely on your traffic mix and the price spread between tiers. Factory reports its router cuts token spend 20 to 25 percent while holding frontier quality. Teams that route a majority of simple traffic to open-weights models, which run at roughly one-sixth of frontier per-token prices, see larger swings. The savings scale with the share of work that does not need a frontier model.

What is cache-aware routing?#

It is routing that keeps you on the same provider that holds your warm prompt cache. Caching the static prefix (system prompt plus injected files) can cut input costs dramatically, but only on a cache hit. If your router re-shuffles providers every turn, you route away from your own warm cache and lose the discount. Sticky routing avoids that.

Sources#

Last updated: June 24, 2026

Official Sources#

Source	What it covers
OpenRouter: Model Fallbacks	The `models` array and automatic model fallback behaviour
OpenRouter: Provider Routing	`provider` block fields: `sort`, `order`, `only`, `max_price`, `allow_fallbacks`
OpenRouter: Prompt Caching	Cache-aware sticky routing and cache discount behaviour
LiteLLM: Fallbacks	`router_settings.fallbacks` YAML syntax and fallback types
LiteLLM: Auto Routing	Complexity-based tier routing on the proxy
LiteLLM: Routing	Router configuration and model group behavior
OpenRouter: Provider Routing	Current provider routing concepts and provider selection controls
Vercel AI Gateway	Managed gateway pattern for model access, observability, and provider routing
Factory: Factory Router	Managed automatic routing for coding agents

For the tool-selection layer, pair this with LLM routers compared. For the infrastructure version, read Models.dev model routing and Envoy AI Gateway.

The Core Idea in One Sentence#

Most requests do not need your most expensive model, so route by the cheapest model that can do the job, and only escalate when it cannot.

The patterns below build up from the simplest useful thing to a full tiered gateway.

Pattern 1: Static Fallback Chain (the floor)#

OpenRouter exposes this as a models array on the request body. It walks the list in order and returns the first success.

Terminal

curl https://openrouter.ai/api/v1/chat/completions \
  -H "Authorization: Bearer $OPENROUTER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "models": [
      "deepseek/deepseek-v4",
      "z-ai/glm-5.2",
      "anthropic/claude-sonnet-latest"
    ],
    "messages": [{"role": "user", "content": "Summarize this changelog."}]
  }'

The LiteLLM equivalent lives in the proxy config, where fallbacks are a map from a primary model name to an ordered list of replacements:

YAML

# litellm-config.yaml
model_list:
  - model_name: deepseek-v4
    litellm_params:
      model: deepseek/deepseek-chat
      api_key: os.environ/DEEPSEEK_API_KEY
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-latest
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  fallbacks: [{"deepseek-v4": ["claude-sonnet"]}]

Pattern 2: Tier by Task Complexity#

The cleanest mental model is three or four named tiers, each mapped to a model:

Tier	Use for	Example model
`simple`	Classification, extraction, short summaries, commit messages	an open-weights small model
`medium`	Standard codegen, refactors, structured drafting	DeepSeek V4 / GLM-5.2
`complex`	Multi-step reasoning, ambiguous specs, architecture	a frontier model
`reasoning`	Hard math, long-horizon planning, tricky debugging	a frontier reasoning model

LiteLLM's proxy supports complexity-based auto routing where you declare tiers and let the proxy score the request and pick one:

YAML

# litellm-config.yaml (tiered auto routing)
router_settings:
  complexity_router_config:
    tiers:
      simple:    glm-5.2-air
      medium:    deepseek-v4
      complex:   claude-sonnet
      reasoning: claude-opus

(LiteLLM auto routing)

Python

TIERS = {
    "simple":    "glm-5.2-air",
    "medium":    "deepseek/deepseek-v4",
    "complex":   "anthropic/claude-sonnet-latest",
    "reasoning": "anthropic/claude-opus-latest",
}

def classify(task: str) -> str:
    """Use the cheapest model to bucket the task. One token of output."""
    rubric = (
        "Reply with exactly one word: simple, medium, complex, or reasoning. "
        "simple = extraction/classification/short summary. "
        "medium = standard codegen or refactor. "
        "complex = ambiguous multi-step work. "
        "reasoning = hard math, planning, or subtle debugging.\n\n"
        f"TASK:\n{task}"
    )
    resp = client.chat.completions.create(
        model=TIERS["simple"],
        messages=[{"role": "user", "content": rubric}],
        max_tokens=1,
    )
    tier = resp.choices[0].message.content.strip().lower()
    return tier if tier in TIERS else "medium"  # safe default

def route(task: str) -> str:
    tier = classify(task)
    return client.chat.completions.create(
        model=TIERS[tier],
        messages=[{"role": "user", "content": task}],
    ).choices[0].message.content

From the archive

Pattern 3: Tier With Escalation (failover that climbs)#

Python

LADDER = ["medium", "complex", "reasoning"]

def is_good_enough(task: str, answer: str) -> bool:
    """Cheap validator: schema check, test run, or a tiny LLM judge."""
    if not answer or len(answer) < 10:
        return False
    # e.g. for codegen: run the generated tests; for JSON: validate the schema
    return passes_local_checks(answer)

def route_with_escalation(task: str, start: str = "medium") -> str:
    start_idx = LADDER.index(start) if start in LADDER else 0
    for tier in LADDER[start_idx:]:
        answer = call_model(TIERS[tier], task)
        if is_good_enough(task, answer):
            return answer
    return answer  # exhausted the ladder, return best effort

Pattern 4: Cost-Ceiling and Provider Routing#

Terminal

curl https://openrouter.ai/api/v1/chat/completions \
  -H "Authorization: Bearer $OPENROUTER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "z-ai/glm-5.2",
    "provider": {
      "sort": "price",
      "max_price": { "prompt": 1, "completion": 3 },
      "allow_fallbacks": true
    },
    "messages": [{"role": "user", "content": "Refactor this function."}]
  }'

Pattern 5: Cache-Aware Routing (the cheapest token is the one you skip)#

Python

# Anthropic-style cache_control on the stable prefix
messages = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": LARGE_STATIC_SYSTEM_PROMPT + injected_repo_context,
                "cache_control": {"type": "ephemeral"},  # cache this prefix
            }
        ],
    },
    {"role": "user", "content": turn_specific_question},  # this part varies
]

Pair this with our DeepSeek cache-first agent notes if you are building loops where the same context is read many times.

Putting It Together: a Reference Gateway Config#

YAML

# litellm-config.yaml  -  unified routing gateway
model_list:
  - model_name: tier-simple
    litellm_params:
      model: openrouter/z-ai/glm-5.2-air
      api_key: os.environ/OPENROUTER_API_KEY
  - model_name: tier-medium
    litellm_params:
      model: openrouter/deepseek/deepseek-v4
      api_key: os.environ/OPENROUTER_API_KEY
  - model_name: tier-complex
    litellm_params:
      model: anthropic/claude-sonnet-latest
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: tier-reasoning
    litellm_params:
      model: anthropic/claude-opus-latest
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  # if a tier fails, climb to the next one up
  fallbacks:
    - {"tier-simple":  ["tier-medium"]}
    - {"tier-medium":  ["tier-complex"]}
    - {"tier-complex": ["tier-reasoning"]}
  # overflowing the cheap window? jump to a wide-context model
  context_window_fallbacks:
    - {"tier-simple": ["tier-complex"]}
  num_retries: 2

Your application code stays trivial. It asks for a tier; the gateway owns reliability, escalation, and provider selection:

Python

client = OpenAI(base_url="http://localhost:4000", api_key="sk-litellm")

answer = client.chat.completions.create(
    model="tier-simple",          # start cheap; the proxy climbs if it must
    messages=[{"role": "user", "content": task}],
).choices[0].message.content

When to Build vs. Buy#

Approach	Best when	Watch out for
Hardcoded fallback `models` array (OpenRouter)	You want reliability today with one config line	Does not pick cheaper models proactively
Self-classified tiers in app code	You want auditable, custom routing logic	You own the classifier and validator quality
LiteLLM proxy with tier groups	You run many apps and want one control plane	One more service to operate and monitor
Managed router (e.g. Factory Router)	You want escalation tuned for you	Less control; trust the vendor's quality bar

Official Sources#

The Core Idea in One Sentence#

Pattern 1: Static Fallback Chain (the floor)#

Pattern 2: Tier by Task Complexity#

Omnigent: Databricks' Meta-Harness for Orchestrating Claude Code, Codex, and Custom Agents

Codex Gets Computer Use in the EU - and a Clean Claude Code Import

'The Orchestration Is the Product': What Perplexity's Aravind Srinivas Sees That the Model Labs Don't

RFC 10008: The New HTTP QUERY Method Explained

Pattern 3: Tier With Escalation (failover that climbs)#

Pattern 4: Cost-Ceiling and Provider Routing#

Pattern 5: Cache-Aware Routing (the cheapest token is the one you skip)#

Putting It Together: a Reference Gateway Config#

When to Build vs. Buy#

The Cost Math, Briefly#

FAQ#

What is model routing in plain terms?#

Does routing hurt output quality?#

Should I use OpenRouter, LiteLLM, or a managed router?#

How much can routing actually save?#

What is cache-aware routing?#

Sources#

LLM Routers Compared: LiteLLM vs Portkey vs OpenRouter in 2026

Models.dev Makes Model Routing Feel Like Infrastructure

AI Model Routing: Why the Orchestration Layer Is the Next Big Play Next to the Labs

Related Tools

OpenRouter

MCP Hub

Smithery

Glama

Apps from Developers Digest

modelpick

AI Model Router

Related Guides

Skill Frontmatter - Claude Code

Subagent Frontmatter - Claude Code

Claude Code Setup Guide

Related Videos

Not Diamond: AI Model Routing in 11 Minutes

Firecrawl for Internet-Enabled LLM Responses with Model Routing

Related Posts

LLM Routers Compared: LiteLLM vs Portkey vs OpenRouter in 2026

Models.dev Makes Model Routing Feel Like Infrastructure

AI Model Routing: Why the Orchestration Layer Is the Next Big Play Next to the Labs

Envoy AI Gateway 1.0 Makes LLM Routing an Infrastructure Decision

OpenRouter in 2026: Review, Setup, and When Model Routing Pays

GLM-5.2 Cost Math: When Open-Weights Coding Models Actually Save You Money

Build with the member tools

Get Smarter About AI Dev

Official Sources#

The Core Idea in One Sentence#

Pattern 1: Static Fallback Chain (the floor)#

Pattern 2: Tier by Task Complexity#

Omnigent: Databricks' Meta-Harness for Orchestrating Claude Code, Codex, and Custom Agents

Codex Gets Computer Use in the EU - and a Clean Claude Code Import

'The Orchestration Is the Product': What Perplexity's Aravind Srinivas Sees That the Model Labs Don't

RFC 10008: The New HTTP QUERY Method Explained

Pattern 3: Tier With Escalation (failover that climbs)#

Pattern 4: Cost-Ceiling and Provider Routing#

Pattern 5: Cache-Aware Routing (the cheapest token is the one you skip)#

Putting It Together: a Reference Gateway Config#

When to Build vs. Buy#

The Cost Math, Briefly#

FAQ#

What is model routing in plain terms?#

Does routing hurt output quality?#

Should I use OpenRouter, LiteLLM, or a managed router?#

How much can routing actually save?#

What is cache-aware routing?#

Sources#

LLM Routers Compared: LiteLLM vs Portkey vs OpenRouter in 2026

Models.dev Makes Model Routing Feel Like Infrastructure

AI Model Routing: Why the Orchestration Layer Is the Next Big Play Next to the Labs

Related Tools

OpenRouter

MCP Hub

Smithery

Glama

Apps from Developers Digest

modelpick

AI Model Router