
TL;DR
Claude Code does not have to call Anthropic's API. Here are five working patterns for running it through your own gateway, on your own models, in your own VPC, with full audit logs and cost control.
| Official Sources | |
|---|---|
| Claude Code Overview | Official Anthropic documentation for Claude Code |
| Claude Code Bedrock/Vertex Configuration | Official docs for AWS Bedrock and Google Vertex AI integration |
| LiteLLM GitHub Repository | Open-source proxy for routing to 100+ LLM providers |
| Claude Code Router GitHub | Community project for routing Claude Code to alternative providers |
| AWS Bedrock Claude Documentation | AWS documentation for Claude models on Bedrock |
| Google Cloud Vertex AI Claude | Google Cloud documentation for Claude on Vertex AI |
The default story for Claude Code is simple: install the CLI, log in with an Anthropic account, and your prompts go straight to api.anthropic.com. That works for individual developers. It does not work for regulated teams, enterprises with strict data residency rules, or anyone who wants to mix Claude with cheaper open-source models without paying retail rates on every token.
Good news: Claude Code is much more open than people realize. The CLI talks to whatever endpoint you point it at, as long as the wire protocol matches Anthropic's Messages API. That single fact unlocks a surprising amount of architectural flexibility. This post walks through five concrete patterns for self-hosting Claude Code on your own infrastructure, from a five-minute LiteLLM proxy on a laptop to a full enterprise gateway with audit logs and SSO.
If you have been running Claude Code at scale, you have probably already hit the usage limits playbook wall. These patterns are the next step.
Claude Code reads two environment variables that change the entire request path:
# Override the API endpoint
export ANTHROPIC_BASE_URL="https://your-gateway.example.com"
# Override the auth token
export ANTHROPIC_AUTH_TOKEN="your-internal-token"
If your gateway speaks the Anthropic Messages API on the wire, Claude Code will not know the difference. This is the foundation of every pattern below.
There is also ANTHROPIC_MODEL for forcing a specific model name and a set of network variables (HTTPS_PROXY, NODE_EXTRA_CA_CERTS) for corporate proxies and custom certificate authorities. The Anthropic documentation calls this enterprise network configuration but it works anywhere.
The simplest pattern. You run LiteLLM as a local proxy on port 4000, point Claude Code at it, and route requests to whatever provider you want behind the scenes. It takes about five minutes to set up.
# litellm-config.yaml
model_list:
- model_name: claude-sonnet-4-7
litellm_params:
model: anthropic/claude-sonnet-4-7
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: claude-haiku-4-7
litellm_params:
model: bedrock/anthropic.claude-haiku-4-7
aws_region_name: us-east-1
- model_name: gpt-5-3
litellm_params:
model: openai/gpt-5.3
api_key: os.environ/OPENAI_API_KEY
router_settings:
routing_strategy: simple-shuffle
general_settings:
master_key: sk-internal-team-key
Run it:
litellm --config litellm-config.yaml --port 4000
Point Claude Code at it:
export ANTHROPIC_BASE_URL="http://localhost:4000"
export ANTHROPIC_AUTH_TOKEN="sk-internal-team-key"
claude
You now have a proxy that logs every request, enforces budget limits per virtual key, and can fall back across providers when one is rate-limited. Same Claude Code experience, full visibility into what your team is sending.
This pattern is great for individual developers and small teams. It does not give you SSO or audit logs that auditors will accept, but it solves the cost-tracking problem for under an hour of setup.
If you cannot send code to Anthropic directly because of compliance, you have two options that already speak Claude: AWS Bedrock and Google Vertex AI. Both host the same Claude models and route everything through your existing cloud account.
For Bedrock:
export CLAUDE_CODE_USE_BEDROCK=1
export AWS_REGION="us-east-1"
export ANTHROPIC_MODEL="us.anthropic.claude-sonnet-4-7-v1:0"
export ANTHROPIC_SMALL_FAST_MODEL="us.anthropic.claude-haiku-4-7-v1:0"
claude
For Vertex:
export CLAUDE_CODE_USE_VERTEX=1
export CLOUD_ML_REGION="us-east5"
export ANTHROPIC_VERTEX_PROJECT_ID="your-gcp-project"
export ANTHROPIC_MODEL="claude-sonnet-4-7@20260301"
claude
Claude Code knows about these flags natively. Authentication uses your existing AWS or GCP credentials, all logs flow into CloudTrail or Cloud Audit Logs, and the data never leaves your cloud account boundary. For most enterprise compliance requirements this is the cleanest answer.
The tradeoff: Bedrock and Vertex sometimes lag behind direct Anthropic on new model releases by a few weeks, and prompt caching support has historically been spottier. Test before committing.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Apr 29, 2026 • 12 min read
Apr 29, 2026 • 12 min read
Apr 29, 2026 • 12 min read
Apr 29, 2026 • 10 min read
For organizations that need centralized identity, audit logs, and per-developer attribution, the right pattern is a self-hosted gateway behind Identity-Aware Proxy. The high-level architecture:
[Developer machine]
-> Local proxy (Claude Code calls this)
-> [Identity-Aware Proxy] (Google Workspace SSO)
-> [FastAPI gateway on Cloud Run]
-> Anthropic API or Bedrock
The local proxy is a tiny piece of software running on the developer's laptop that intercepts Claude Code's API calls, fetches a fresh OIDC token from gcloud, and forwards the request to the company gateway with Authorization: Bearer <id-token>. IAP validates the token, confirms the user is in the right Google Workspace group, and forwards to your FastAPI service. Your service logs the request, attaches the user identity, and proxies to Anthropic.
The skeleton of the gateway service:
# gateway.py
from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import StreamingResponse
import httpx
import os
app = FastAPI()
ANTHROPIC_KEY = os.environ["ANTHROPIC_API_KEY"]
@app.post("/v1/messages")
async def messages(request: Request):
user = request.headers.get("X-Goog-Authenticated-User-Email")
if not user:
raise HTTPException(401, "missing identity")
body = await request.body()
# Log who, when, what model, token estimate
log_request(user=user, body=body)
# Forward to Anthropic, streaming back to the client
headers = {
"x-api-key": ANTHROPIC_KEY,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
}
async def upstream():
async with httpx.AsyncClient(timeout=None) as client:
async with client.stream(
"POST",
"https://api.anthropic.com/v1/messages",
content=body,
headers=headers,
) as r:
async for chunk in r.aiter_raw():
yield chunk
return StreamingResponse(upstream(), media_type="text/event-stream")
Every developer sets ANTHROPIC_BASE_URL to the gateway and authenticates via SSO. You get a single audit log of every prompt anyone in the company sent, attributable to a specific identity. When someone leaves the company, removing them from the Workspace group revokes their access immediately. No scattered API keys to rotate.
This is the pattern that makes Claude Code viable in regulated industries. Build it once, every developer benefits.
You do not have to use Anthropic models with Claude Code. The open-source Claude Code Router project translates between Claude's wire format and any other provider, including local Ollama models, OpenRouter, Groq, DeepSeek, and Together.
Install and configure:
npm install -g @musistudio/claude-code-router
# ~/.claude-code-router/config.json
{
"Providers": [
{
"name": "ollama",
"api_base_url": "http://localhost:11434/v1/chat/completions",
"models": ["qwen3.5-coder:35b", "deepseek-coder:33b"]
},
{
"name": "openrouter",
"api_base_url": "https://openrouter.ai/api/v1/chat/completions",
"api_key": "$OPENROUTER_API_KEY",
"models": ["anthropic/claude-sonnet-4-7", "google/gemini-2.5-pro"]
}
],
"Router": {
"default": "ollama,qwen3.5-coder:35b",
"background": "ollama,qwen3.5-coder:35b",
"think": "openrouter,anthropic/claude-sonnet-4-7",
"longContext": "openrouter,anthropic/claude-sonnet-4-7"
}
}Run Claude Code through the router:
ccr code
The router routes "thinking" tasks to Claude Sonnet on OpenRouter and routine tasks to a local Qwen model on Ollama. You pay nothing for the bulk of your tokens, get frontier-quality reasoning when you need it, and your code never leaves your laptop for the local-only routes.
This is the budget-conscious pattern. We documented the full setup in our comparison of every AI coding tool's economics, and it pairs well with cheap GPU rentals if your laptop is not powerful enough to run a 35B model locally.
The most extreme version. You run an open-weight coding model on your own GPUs, expose an Anthropic-compatible endpoint, and Claude Code never touches the public internet. This is what defense, healthcare, and certain financial customers require.
Stack:
Minimal docker compose:
services:
vllm:
image: vllm/vllm-openai:latest
command:
- --model=Qwen/Qwen3.5-Coder-32B-Instruct
- --max-model-len=131072
- --tensor-parallel-size=2
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 2
capabilities: [gpu]
ports:
- "8000:8000"
litellm:
image: ghcr.io/berriai/litellm:main-latest
environment:
LITELLM_MASTER_KEY: sk-internal
volumes:
- ./litellm-config.yaml:/app/config.yaml
command: ["--config", "/app/config.yaml", "--port", "4000"]
ports:
- "4000:4000"
Developers connect like this:
export ANTHROPIC_BASE_URL="https://internal-claude.corp.example.com"
export ANTHROPIC_AUTH_TOKEN="$INTERNAL_TOKEN"
export ANTHROPIC_MODEL="qwen3.5-coder-32b"
claude
You give up some quality. Qwen3.5 and DeepSeek are excellent but not Sonnet 4.7. For most refactors, test writing, and routine feature work they are good enough. For the hard 10 percent of problems, route to the gateway pattern above when policy allows.
This pattern also pairs well with building multi-agent workflows in Claude Code, because cheap local inference makes fan-out architectures economical that would be cost-prohibitive against the public API.
For a walkthrough of the LiteLLM and Claude Code Router patterns running side by side on a single laptop, with cost dashboards and live token streaming:
Subscribe to Developers Digest for the rest of the self-hosting series.
A simple decision matrix:
| Need | Pattern |
|---|---|
| Just want cost tracking and team budgets | LiteLLM proxy (Pattern 1) |
| Compliance, no Anthropic API direct, AWS or GCP shop | Bedrock or Vertex (Pattern 2) |
| Centralized identity, audit logs, SSO for the whole org | Enterprise gateway with IAP (Pattern 3) |
| Want to slash costs by routing easy tasks to local models | Claude Code Router (Pattern 4) |
| Air-gapped, cannot send code anywhere external | Self-hosted GPUs with vLLM (Pattern 5) |
Most teams should start with Pattern 1. It is reversible, ships in an afternoon, and tells you whether your usage justifies the more invasive patterns. The teams that need Pattern 5 already know they need it; the rest are doing premature optimization.
The reason these patterns exist is that Anthropic made a deliberate decision to keep Claude Code's wire protocol portable. The CLI is opinionated about how it works on your machine - the sub-agent system, the hooks, the worktree integration - but completely agnostic about which backend serves the model. That separation is rare among AI coding tools.
It also means the cost ceiling on Claude Code is a lot lower than it appears. The retail price assumes everything goes to the public API. With the patterns above, real-world team costs come down by 40 to 90 percent depending on how aggressive you are about routing, with no change to the developer experience.
If you are evaluating AI coding tools for an organization, Claude Code's self-hosting story is not a sidebar. It is one of the strongest arguments for picking it over the alternatives. Pair it with our full 2026 comparison matrix when you make the case to your platform team.
Yes. Claude Code reads the ANTHROPIC_BASE_URL and ANTHROPIC_AUTH_TOKEN environment variables to determine where to send requests. Any backend that speaks the Anthropic Messages API wire protocol will work, including LiteLLM proxies, AWS Bedrock, Google Vertex AI, or your own gateway service.
LiteLLM proxy running locally on port 4000. It takes about five minutes to set up, logs every request, enforces budget limits per virtual key, and can fall back across providers when one is rate-limited. Point Claude Code at it with export ANTHROPIC_BASE_URL="http://localhost:4000" and you have full visibility into team usage.
Set three environment variables: CLAUDE_CODE_USE_BEDROCK=1, AWS_REGION="us-east-1", and ANTHROPIC_MODEL="us.anthropic.claude-sonnet-4-7-v1:0". Claude Code will authenticate using your existing AWS credentials and route all traffic through Bedrock. All logs flow into CloudTrail and data never leaves your AWS account boundary.
Yes, using the Claude Code Router project. It translates between Claude's wire format and any other provider, including local Ollama models, OpenRouter, Groq, and DeepSeek. You can configure it to route routine tasks to cheap local models and only use frontier models for complex reasoning tasks, cutting costs by 40 to 90 percent.
A stack of open-weight coding models like Qwen3.5-Coder-32B served via vLLM, plus LiteLLM in proxy mode to translate the wire format. All traffic stays inside your VPC with no egress to api.anthropic.com required. You give up some quality compared to Sonnet 4.7, but for most routine coding tasks the open models are good enough.
Deploy a gateway service behind Identity-Aware Proxy (or your organization's equivalent). Developers run a local proxy that attaches OIDC tokens to requests, IAP validates identity against your SSO provider, and your gateway logs every prompt with the authenticated user identity. This gives you centralized audit logs and instant access revocation when someone leaves the company.
It depends on the backend. Direct Anthropic API has full prompt caching support. AWS Bedrock and Google Vertex AI have historically been spottier on prompt caching, sometimes lagging behind new features by a few weeks. Test your specific use case before committing to a backend if prompt caching is critical to your cost model.
Most teams should start with LiteLLM proxy (Pattern 1). It ships in an afternoon, is fully reversible, and tells you whether your usage justifies the more complex patterns. Teams that need air-gapped deployments already know they need them. Everyone else is likely doing premature optimization by jumping straight to enterprise gateway architectures.
Read next
A practical operational guide to Claude Code usage limits in 2026: plan behavior, API key pitfalls, routing choices, and team controls using hooks and subagents.
9 min readHow to use Claude Code's Task tool, custom sub-agents, and worktrees to run parallel development workflows. Real prompt examples, agent configurations, and workflow patterns from daily use.
11 min read12 AI coding tools across 4 architecture types, compared on pricing, strengths, weaknesses, and best use cases. The definitive comparison matrix for 2026.
15 min readTechnical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Anthropic's agentic coding CLI. Runs in your terminal, edits files autonomously, spawns sub-agents, and maintains memory...
View ToolInteractive TUI dashboard that shows exactly where your Claude Code and Cursor tokens are going, in real time.
View ToolAnthropic's flagship reasoning model. Best-in-class for coding, long-context analysis, and agentic workflows. 1M token c...
View ToolMac app for running parallel Claude Code, Codex, and Cursor agents in isolated workspaces. Watch every agent work at onc...
View ToolUnlock pro skills and share private collections with your team.
View AppCatch broken SKILL.md files in CI before they hit your team.
View AppEvery coding agent in one window. Stop alt-tabbing between Claude, Codex, and Cursor.
View AppInstall Claude Code, configure your first project, and start shipping code with AI in under 5 minutes.
Getting StartedA practical walk-through of how to design, write, and ship a Claude Code skill - from choosing when to trigger, through allowed-tools, to the steps the agent will actually follow.
Getting StartedA concrete step-by-step guide to moving your development workflow from Cursor to Claude Code - settings, rules, keybindings, and the habits that transfer.
Getting Started
Open Design: Open-Source n8n App That Turns Any Website into a Brand Kit, Design System, HTML + Images The video introduces Open Design, an MIT-licensed full-stack template that combines AI and n8n a...

Nimbalyst Demo: A Visual Workspace for Codex + Claude Code with Kanban, Plans, and AI Commits Try it: https://nimbalyst.com/ Star Repo Here: https://github.com/Nimbalyst/nimbalyst This video demos N...

Composio: Connect AI Agents to 1,000+ Apps via CLI (Gmail, Google Docs/Sheets, Hacker News Workflows) Check out Composio here: http://dashboard.composio.dev/?utm_source=Youtube&utm_channel=0426&utm_...

A practical operational guide to Claude Code usage limits in 2026: plan behavior, API key pitfalls, routing choices, and...

How to use Claude Code's Task tool, custom sub-agents, and worktrees to run parallel development workflows. Real prompt...

12 AI coding tools across 4 architecture types, compared on pricing, strengths, weaknesses, and best use cases. The defi...

31 deployed apps. 7 down. Favicons missing on 20 of 24 reachable hosts. Sentry on zero. Here is how a single audit turne...

How we ported 38 apps off Replit and onto Coolify in a single day, using parallel Claude Code subagents, gh, and neonctl...

One dev, one CLI, 24 subdomains, and a lot of parallel agents. The playbook for shipping an AI app portfolio.

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.