Self-Hosting AI Agents: 5 Ways to Run Claude Code on Your Own Infra

The default story for Claude Code is simple: install the CLI, log in with an Anthropic account, and your prompts go straight to api.anthropic.com. That works for individual developers. It does not work for regulated teams, enterprises with strict data residency rules, or anyone who wants to mix Claude with cheaper open-source models without paying retail rates on every token.

Good news: Claude Code is much more open than people realize. The CLI talks to whatever endpoint you point it at, as long as the wire protocol matches Anthropic's Messages API. That single fact unlocks a surprising amount of architectural flexibility. This post walks through five concrete patterns for self-hosting Claude Code on your own infrastructure, from a five-minute LiteLLM proxy on a laptop to a full enterprise gateway with audit logs and SSO.

If you have been running Claude Code at scale, you have probably already hit the usage limits playbook wall. These patterns are the next step.

How Claude Code Talks to a Backend

Claude Code reads two environment variables that change the entire request path:

# Override the API endpoint
export ANTHROPIC_BASE_URL="https://your-gateway.example.com"

# Override the auth token
export ANTHROPIC_AUTH_TOKEN="your-internal-token"

If your gateway speaks the Anthropic Messages API on the wire, Claude Code will not know the difference. This is the foundation of every pattern below.

There is also ANTHROPIC_MODEL for forcing a specific model name and a set of network variables (HTTPS_PROXY, NODE_EXTRA_CA_CERTS) for corporate proxies and custom certificate authorities. The Anthropic documentation calls this enterprise network configuration but it works anywhere.

Pattern 1: LiteLLM Proxy for Local Routing

The simplest pattern. You run LiteLLM as a local proxy on port 4000, point Claude Code at it, and route requests to whatever provider you want behind the scenes. It takes about five minutes to set up.

# litellm-config.yaml
model_list:
  - model_name: claude-sonnet-4-7
    litellm_params:
      model: anthropic/claude-sonnet-4-7
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: claude-haiku-4-7
    litellm_params:
      model: bedrock/anthropic.claude-haiku-4-7
      aws_region_name: us-east-1

  - model_name: gpt-5-3
    litellm_params:
      model: openai/gpt-5.3
      api_key: os.environ/OPENAI_API_KEY

router_settings:
  routing_strategy: simple-shuffle

general_settings:
  master_key: sk-internal-team-key

Run it:

litellm --config litellm-config.yaml --port 4000

Point Claude Code at it:

export ANTHROPIC_BASE_URL="http://localhost:4000"
export ANTHROPIC_AUTH_TOKEN="sk-internal-team-key"
claude

You now have a proxy that logs every request, enforces budget limits per virtual key, and can fall back across providers when one is rate-limited. Same Claude Code experience, full visibility into what your team is sending.

This pattern is great for individual developers and small teams. It does not give you SSO or audit logs that auditors will accept, but it solves the cost-tracking problem for under an hour of setup.

Pattern 2: Bedrock and Vertex for Compliance

If you cannot send code to Anthropic directly because of compliance, you have two options that already speak Claude: AWS Bedrock and Google Vertex AI. Both host the same Claude models and route everything through your existing cloud account.

For Bedrock:

export CLAUDE_CODE_USE_BEDROCK=1
export AWS_REGION="us-east-1"
export ANTHROPIC_MODEL="us.anthropic.claude-sonnet-4-7-v1:0"
export ANTHROPIC_SMALL_FAST_MODEL="us.anthropic.claude-haiku-4-7-v1:0"
claude

For Vertex:

export CLAUDE_CODE_USE_VERTEX=1
export CLOUD_ML_REGION="us-east5"
export ANTHROPIC_VERTEX_PROJECT_ID="your-gcp-project"
export ANTHROPIC_MODEL="claude-sonnet-4-7@20260301"
claude

Claude Code knows about these flags natively. Authentication uses your existing AWS or GCP credentials, all logs flow into CloudTrail or Cloud Audit Logs, and the data never leaves your cloud account boundary. For most enterprise compliance requirements this is the cleanest answer.

The tradeoff: Bedrock and Vertex sometimes lag behind direct Anthropic on new model releases by a few weeks, and prompt caching support has historically been spottier. Test before committing.

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

From the archive

Shipping OpenAI Symphony in Prod: A Real-World Guide

Apr 29, 2026 • 12 min read

Tool Use in the Claude API: Production Patterns for Reliable Agents

Apr 29, 2026 • 12 min read

Vercel's Agentic Infrastructure Stack Explained

Apr 29, 2026 • 12 min read

Vercel's New Durable Execution Programming Model: A Developer's Guide

Apr 29, 2026 • 10 min read

Pattern 3: Enterprise Gateway with IAP

For organizations that need centralized identity, audit logs, and per-developer attribution, the right pattern is a self-hosted gateway behind Identity-Aware Proxy. The high-level architecture:

[Developer machine]
  -> Local proxy (Claude Code calls this)
  -> [Identity-Aware Proxy] (Google Workspace SSO)
  -> [FastAPI gateway on Cloud Run]
  -> Anthropic API or Bedrock

The local proxy is a tiny piece of software running on the developer's laptop that intercepts Claude Code's API calls, fetches a fresh OIDC token from gcloud, and forwards the request to the company gateway with Authorization: Bearer <id-token>. IAP validates the token, confirms the user is in the right Google Workspace group, and forwards to your FastAPI service. Your service logs the request, attaches the user identity, and proxies to Anthropic.

The skeleton of the gateway service:

# gateway.py
from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import StreamingResponse
import httpx
import os

app = FastAPI()
ANTHROPIC_KEY = os.environ["ANTHROPIC_API_KEY"]

@app.post("/v1/messages")
async def messages(request: Request):
    user = request.headers.get("X-Goog-Authenticated-User-Email")
    if not user:
        raise HTTPException(401, "missing identity")

    body = await request.body()

    # Log who, when, what model, token estimate
    log_request(user=user, body=body)

    # Forward to Anthropic, streaming back to the client
    headers = {
        "x-api-key": ANTHROPIC_KEY,
        "anthropic-version": "2023-06-01",
        "content-type": "application/json",
    }

    async def upstream():
        async with httpx.AsyncClient(timeout=None) as client:
            async with client.stream(
                "POST",
                "https://api.anthropic.com/v1/messages",
                content=body,
                headers=headers,
            ) as r:
                async for chunk in r.aiter_raw():
                    yield chunk

    return StreamingResponse(upstream(), media_type="text/event-stream")

Every developer sets ANTHROPIC_BASE_URL to the gateway and authenticates via SSO. You get a single audit log of every prompt anyone in the company sent, attributable to a specific identity. When someone leaves the company, removing them from the Workspace group revokes their access immediately. No scattered API keys to rotate.

This is the pattern that makes Claude Code viable in regulated industries. Build it once, every developer benefits.

Pattern 4: Open-Source Models via Claude Code Router

You do not have to use Anthropic models with Claude Code. The open-source Claude Code Router project translates between Claude's wire format and any other provider, including local Ollama models, OpenRouter, Groq, DeepSeek, and Together.

Install and configure:

npm install -g @musistudio/claude-code-router

# ~/.claude-code-router/config.json
{
  "Providers": [
    {
      "name": "ollama",
      "api_base_url": "http://localhost:11434/v1/chat/completions",
      "models": ["qwen3.5-coder:35b", "deepseek-coder:33b"]
    },
    {
      "name": "openrouter",
      "api_base_url": "https://openrouter.ai/api/v1/chat/completions",
      "api_key": "$OPENROUTER_API_KEY",
      "models": ["anthropic/claude-sonnet-4-7", "google/gemini-2.5-pro"]
    }
  ],
  "Router": {
    "default": "ollama,qwen3.5-coder:35b",
    "background": "ollama,qwen3.5-coder:35b",
    "think": "openrouter,anthropic/claude-sonnet-4-7",
    "longContext": "openrouter,anthropic/claude-sonnet-4-7"
  }
}

Run Claude Code through the router:

ccr code

The router routes "thinking" tasks to Claude Sonnet on OpenRouter and routine tasks to a local Qwen model on Ollama. You pay nothing for the bulk of your tokens, get frontier-quality reasoning when you need it, and your code never leaves your laptop for the local-only routes.

This is the budget-conscious pattern. We documented the full setup in our comparison of every AI coding tool's economics, and it pairs well with cheap GPU rentals if your laptop is not powerful enough to run a 35B model locally.

Pattern 5: Air-Gapped on Your Own GPUs

The most extreme version. You run an open-weight coding model on your own GPUs, expose an Anthropic-compatible endpoint, and Claude Code never touches the public internet. This is what defense, healthcare, and certain financial customers require.

Stack:

Model: Qwen3.5-Coder-32B or DeepSeek-Coder-33B, served via vLLM
Adapter: LiteLLM in proxy mode, configured to translate Anthropic format to OpenAI format
Network: All traffic stays inside the VPC, no egress rules to api.anthropic.com required

Minimal docker compose:

services:
  vllm:
    image: vllm/vllm-openai:latest
    command:
      - --model=Qwen/Qwen3.5-Coder-32B-Instruct
      - --max-model-len=131072
      - --tensor-parallel-size=2
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]
    ports:
      - "8000:8000"

  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    environment:
      LITELLM_MASTER_KEY: sk-internal
    volumes:
      - ./litellm-config.yaml:/app/config.yaml
    command: ["--config", "/app/config.yaml", "--port", "4000"]
    ports:
      - "4000:4000"

Developers connect like this:

export ANTHROPIC_BASE_URL="https://internal-claude.corp.example.com"
export ANTHROPIC_AUTH_TOKEN="$INTERNAL_TOKEN"
export ANTHROPIC_MODEL="qwen3.5-coder-32b"
claude

You give up some quality. Qwen3.5 and DeepSeek are excellent but not Sonnet 4.7. For most refactors, test writing, and routine feature work they are good enough. For the hard 10 percent of problems, route to the gateway pattern above when policy allows.

This pattern also pairs well with building multi-agent workflows in Claude Code, because cheap local inference makes fan-out architectures economical that would be cost-prohibitive against the public API.

Watch It Built End to End

For a walkthrough of the LiteLLM and Claude Code Router patterns running side by side on a single laptop, with cost dashboards and live token streaming:

Subscribe to Developers Digest for the rest of the self-hosting series.

What to Pick

A simple decision matrix:

Need	Pattern
Just want cost tracking and team budgets	LiteLLM proxy (Pattern 1)
Compliance, no Anthropic API direct, AWS or GCP shop	Bedrock or Vertex (Pattern 2)
Centralized identity, audit logs, SSO for the whole org	Enterprise gateway with IAP (Pattern 3)
Want to slash costs by routing easy tasks to local models	Claude Code Router (Pattern 4)
Air-gapped, cannot send code anywhere external	Self-hosted GPUs with vLLM (Pattern 5)

Most teams should start with Pattern 1. It is reversible, ships in an afternoon, and tells you whether your usage justifies the more invasive patterns. The teams that need Pattern 5 already know they need it; the rest are doing premature optimization.

The Bigger Picture

The reason these patterns exist is that Anthropic made a deliberate decision to keep Claude Code's wire protocol portable. The CLI is opinionated about how it works on your machine - the sub-agent system, the hooks, the worktree integration - but completely agnostic about which backend serves the model. That separation is rare among AI coding tools.

It also means the cost ceiling on Claude Code is a lot lower than it appears. The retail price assumes everything goes to the public API. With the patterns above, real-world team costs come down by 40 to 90 percent depending on how aggressive you are about routing, with no change to the developer experience.

If you are evaluating AI coding tools for an organization, Claude Code's self-hosting story is not a sidebar. It is one of the strongest arguments for picking it over the alternatives. Pair it with our full 2026 comparison matrix when you make the case to your platform team.

Claude Code Usage Limits in 2026: The Practical Playbook for Pro and Max Teams

Building Multi-Agent Workflows in Claude Code: A Practical Tutorial

Every AI Coding Tool Compared: The 2026 Matrix

How Claude Code Talks to a Backend

Pattern 1: LiteLLM Proxy for Local Routing

Pattern 2: Bedrock and Vertex for Compliance

Shipping OpenAI Symphony in Prod: A Real-World Guide

Tool Use in the Claude API: Production Patterns for Reliable Agents

Vercel's Agentic Infrastructure Stack Explained

Vercel's New Durable Execution Programming Model: A Developer's Guide

Pattern 3: Enterprise Gateway with IAP

Pattern 4: Open-Source Models via Claude Code Router

Pattern 5: Air-Gapped on Your Own GPUs

Watch It Built End to End

What to Pick

The Bigger Picture

Comments

Related Tools

Claude Code

Codeburn

Claude Opus 4.7

Cursor

Apps from Developers Digest

Hooks Directory

SkillForge CI

Agent Hub

Related Guides

Getting Started with Claude Code

Writing Your First Claude Code Skill

Migrating from Cursor to Claude Code

Related Videos

Nimbalyst: The Open-Source Visual Workspace for Building with Codex and Claude Code

Composio: Connect OpenClaw & Claude Code to 1,000+ Apps via CLI

Claude Code Channels in 8 Minutes

Related Posts

Claude Code Usage Limits in 2026: The Practical Playbook for Pro and Max Teams

Building Multi-Agent Workflows in Claude Code: A Practical Tutorial

Every AI Coding Tool Compared: The 2026 Matrix

How I'm Building 24 AI-Powered Apps in Parallel

Karpathy CLAUDE.md Skills: Use the Viral Rules as a Menu, Not a Template

The 98% Context Reduction Pattern

Get Smarter About AI Dev

Claude Code Usage Limits in 2026: The Practical Playbook for Pro and Max Teams

Building Multi-Agent Workflows in Claude Code: A Practical Tutorial

Every AI Coding Tool Compared: The 2026 Matrix

How Claude Code Talks to a Backend

Pattern 1: LiteLLM Proxy for Local Routing

Pattern 2: Bedrock and Vertex for Compliance

Shipping OpenAI Symphony in Prod: A Real-World Guide

Tool Use in the Claude API: Production Patterns for Reliable Agents

Vercel's Agentic Infrastructure Stack Explained

Vercel's New Durable Execution Programming Model: A Developer's Guide

Pattern 3: Enterprise Gateway with IAP

Pattern 4: Open-Source Models via Claude Code Router

Pattern 5: Air-Gapped on Your Own GPUs

Watch It Built End to End

What to Pick

The Bigger Picture

Comments

Related Tools

Claude Code

Codeburn

Claude Opus 4.7

Cursor

Apps from Developers Digest

Hooks Directory

SkillForge CI

Agent Hub

Related Guides

Getting Started with Claude Code

Writing Your First Claude Code Skill

Migrating from Cursor to Claude Code

Related Videos

Nimbalyst: The Open-Source Visual Workspace for Building with Codex and Claude Code

Composio: Connect OpenClaw & Claude Code to 1,000+ Apps via CLI

Claude Code Channels in 8 Minutes

Related Posts

Claude Code Usage Limits in 2026: The Practical Playbook for Pro and Max Teams

Building Multi-Agent Workflows in Claude Code: A Practical Tutorial

Every AI Coding Tool Compared: The 2026 Matrix