
TL;DR
The defensive patterns that keep Claude integrations alive in production. Retry shapes, backoff with jitter, circuit breakers, fallback chains, and the observability you need to debug at 3am.
Direct answer
The defensive patterns that keep Claude integrations alive in production. Retry shapes, backoff with jitter, circuit breakers, fallback chains, and the observability you need to debug at 3am.
Best for
Developers comparing real tool tradeoffs before choosing a stack.
Covers
Verdict, tradeoffs, pricing signals, workflow fit, and related alternatives.
Read next
The math of agent pipelines is brutal. 85% reliability per step compounds to about 20% at 10 steps. Here is why long chains collapse in production, and the six patterns the field has converged on to fight the decay.
9 min readA practical architecture for multi-step Claude agents. Loop patterns, state management, error recovery, and the production gotchas that turn a five-step demo into a 20 percent success rate at scale.
11 min readA production-grade RAG pipeline with Claude. Chunking that survives real documents, retrieval tuning that actually moves the needle, citation tracking, and the prompt caching trick that makes RAG cheap enough to ship.
10 min readMost production incidents I have seen with Claude integrations are not about the model. They are about the network between you and the model.
For broader context, pair this with What Is Claude Code? The Complete Guide for 2026 and 60 Claude Code Tips and Tricks for Power Users; those companion pieces show where this fits in the wider AI developer workflow.
Rate limits. Timeouts. Transient 5xx. The occasional auth blip when an API key gets rotated and not propagated. None of these are interesting bugs. All of them will take your app down if you don't have the defensive layer in place.
The good news: the defensive layer is small. Maybe 200 lines of code. Once it is in, your reliability story flips from "depends on whether Anthropic had a good day" to "depends on whether you wrote your retry loop correctly," which is a much better problem to have.
This post is the playbook. Error categorization, retry shape, rate limit handling, fallback strategy, and the monitoring you need to know any of it is working.
Anthropic returns standard HTTP status codes plus a structured error body. The five buckets:
400 - bad request. Invalid JSON, malformed message structure, schema violation on tool input. Permanent. Do not retry. Log the request body and fix the call site. The error message in the response body usually points directly at the problem.
401 - authentication. API key is missing, wrong, or revoked. Permanent for this request. Do not retry. Alert immediately - this means your secrets pipeline is broken.
429 - rate limit. Transient. Retry with exponential backoff. Anthropic returns a retry-after header on rate limit responses. Respect it. Don't guess.
500/502/503/504 - server errors. Transient. Retry with backoff. The 504 specifically means a timeout on Anthropic's side, often during a slow generation. If it keeps happening on the same request, the prompt is too long or the max_tokens is too high.
Network errors. ECONNRESET, ETIMEDOUT, DNS failures, TLS handshake failures. Transient. Retry with backoff. These are not from Anthropic, they are from the path between you and them.
The Anthropic SDK exposes these as typed errors, which makes the dispatch clean.
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
function isRetryable(err: unknown): boolean {
if (err instanceof Anthropic.APIError) {
if (err.status === 429) return true;
if (err.status && err.status >= 500) return true;
return false;
}
if (err instanceof Anthropic.APIConnectionError) return true;
if (err instanceof Anthropic.APIConnectionTimeoutError) return true;
return false;
}
Two errors not in the SDK type hierarchy that you still want to handle: bare fetch failures from a flaky network and JSON parse errors from a truncated response. Both are transient. Both want retries.
The naive retry is "wait one second, try again, wait two, try again." This works for one client. With a hundred concurrent clients all retrying on the same wall-clock schedule, you get a thundering herd that takes the upstream down a second time the moment it recovers.
The fix is jitter. Add random noise to the backoff so retries spread out instead of stacking. Full jitter (random between zero and the cap) is the safest variant.
async function sleep(ms: number) {
return new Promise((r) => setTimeout(r, ms));
}
function backoffMs(attempt: number, base = 500, cap = 30_000): number {
const exp = Math.min(cap, base * 2 ** attempt);
return Math.floor(Math.random() * exp);
}
async function withRetry<T>(
fn: () => Promise<T>,
options: { maxAttempts?: number; onRetry?: (attempt: number, err: unknown) => void } = {}
): Promise<T> {
const maxAttempts = options.maxAttempts ?? 5;
let lastErr: unknown;
for (let attempt = 0; attempt < maxAttempts; attempt++) {
try {
return await fn();
} catch (err) {
lastErr = err;
if (!isRetryable(err) || attempt === maxAttempts - 1) throw err;
let delay = backoffMs(attempt);
if (err instanceof Anthropic.APIError && err.status === 429) {
const retryAfter = err.headers?.["retry-after"];
if (retryAfter) delay = Math.max(delay, Number(retryAfter) * 1000);
}
options.onRetry?.(attempt, err);
await sleep(delay);
}
}
throw lastErr;
}
const response = await withRetry(() =>
client.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 1024,
messages: [{ role: "user", content: "Hello" }],
})
);
A few things worth noting. The retry-after header takes precedence over the calculated backoff because Anthropic knows better than you do when to come back. The onRetry callback is where you push metrics - don't skip it. The max-attempts cap is at five because beyond that you are throwing good money after bad, and the human at the other end of your API has been waiting too long anyway.
Note that the official Anthropic SDKs already do retry under the hood with sensible defaults. You can configure this with maxRetries on the client. The wrapper above is for the cases where you want explicit control - logging every retry, custom backoff shape, integration with a circuit breaker.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Apr 29, 2026 • 11 min read
Apr 29, 2026 • 9 min read
Apr 29, 2026 • 10 min read
Apr 29, 2026 • 9 min read
When Anthropic is having a real outage (rare, but it happens), retrying is counterproductive. You burn budget on doomed requests and prevent your app from falling back to a degraded mode that might actually serve users.
Circuit breakers solve this. Track recent failure rate. When it crosses a threshold, open the breaker - skip the upstream entirely and short-circuit to a fallback. Periodically try one request to see if upstream is back. Close the breaker when it recovers.
type BreakerState = "closed" | "open" | "half-open";
class CircuitBreaker {
private state: BreakerState = "closed";
private failures = 0;
private openedAt = 0;
constructor(
private threshold = 5,
private resetMs = 30_000
) {}
async run<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === "open") {
if (Date.now() - this.openedAt > this.resetMs) this.state = "half-open";
else throw new Error("circuit open");
}
try {
const result = await fn();
if (this.state === "half-open") {
this.state = "closed";
this.failures = 0;
}
return result;
} catch (err) {
this.failures++;
if (this.failures >= this.threshold) {
this.state = "open";
this.openedAt = Date.now();
}
throw err;
}
}
}
const breaker = new CircuitBreaker();
async function callClaude(input: string) {
return breaker.run(() =>
withRetry(() =>
client.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 1024,
messages: [{ role: "user", content: input }],
})
)
);
}
Tune the threshold to your traffic. For an app doing 1000 RPM, a threshold of 5 fires too eagerly on a normal 1 percent failure rate. For an app doing 10 RPM, a threshold of 50 takes 5 minutes to detect a real outage. A rolling-window failure rate (e.g. open if 50 percent of the last 20 requests failed) is more robust than absolute counts.
Retrying 429s is fine. Not hitting them is better.
Anthropic rate limits are organization-wide and measured in input tokens per minute, output tokens per minute, and requests per minute. The exact numbers depend on your usage tier. The mistake teams make is assuming the rate limit applies per-key or per-host - it does not.
Three patterns to stay under the limit.
Token bucket on your side. Estimate the input + output tokens for each call. Refill the bucket at your tier's rate. Block (or queue) when the bucket is empty. This is the single best rate limit prevention pattern.
Spread bursts. If you have 100 agent runs to launch, don't fire them all at once. Stagger over 60 seconds. Same total throughput, none of the burst pain.
Watch the headers. Anthropic returns anthropic-ratelimit-requests-remaining and similar headers on every response. When remaining drops below 10 percent of your limit, slow down. The headers are the only honest source of truth.
class TokenBucket {
private tokens: number;
private lastRefill = Date.now();
constructor(private capacity: number, private refillPerMs: number) {
this.tokens = capacity;
}
async take(amount: number): Promise<void> {
while (true) {
this.refill();
if (this.tokens >= amount) {
this.tokens -= amount;
return;
}
const needed = amount - this.tokens;
const waitMs = Math.ceil(needed / this.refillPerMs);
await sleep(waitMs);
}
}
private refill() {
const now = Date.now();
const delta = (now - this.lastRefill) * this.refillPerMs;
this.tokens = Math.min(this.capacity, this.tokens + delta);
this.lastRefill = now;
}
}
Default SDK timeout is 10 minutes. That is too long for almost every production use case. Pick a real number based on your UX budget.
Interactive chat: 30 seconds total, but stream so the user sees tokens at 1s. If generation hasn't finished by 30s, cancel.
Agent step: 60-90 seconds. Long enough for a slow tool result to come back. Short enough that a stuck agent doesn't burn budget for an hour.
Batch job: whatever your batch SLA allows, but always finite.
const response = await client.messages.create(
{
model: "claude-sonnet-4-5",
max_tokens: 1024,
messages: [{ role: "user", content: "Hello" }],
},
{ timeout: 30_000 }
);
For streaming requests, time-to-first-token and total-time are different timeouts. Time-to-first-token should be 5-10s. Total time depends on how long an answer you want.
When Claude is down or slow or rate-limited and you cannot wait, the question is what to do instead. Options in increasing order of "effort to set up":
Cached response. If the same question was asked recently, return the prior answer with a "cached" badge. Free, fast, sometimes wrong.
A smaller model. Fall back from Sonnet to Haiku. Lower quality, much higher availability headroom, lower cost. For a lot of use cases the user won't notice.
A different provider. Have a parallel path through OpenAI or Google. Triggered only when the breaker is open. Be honest with yourself about what fraction of your prompts are portable - tool use schemas, prompt caching, and extended thinking will not transfer cleanly.
A static degraded UX. Show "AI is temporarily unavailable, here are some prewritten resources." Last resort. Better than a blank page.
The right choice depends on the business cost of latency versus the business cost of a wrong answer. For a customer support chatbot, cached or smaller-model is usually better. For a financial advisor, static degraded is better.
The reliability work above does nothing if you don't know when it fires. Five metrics, every Claude call.
Latency p50, p95, p99. Error rate by status code. Tokens in, tokens out, cost per call. Cache hit rate (if you are using prompt caching). Retry count distribution.
Log structured. Every call gets request_id, status, latency, tokens, retries, model. The Anthropic response headers give you a request ID that is what their support team will ask for if you file a ticket. Save it.
We run all of this on agent-finops for cost and rate-limit visibility, and replay the actual prompt/response pairs through tracetrail when something looks weird. The combination is the difference between "we had an incident yesterday" and "we know exactly what happened, here is the fix."
If you want to see the full reliability stack assembled in a working app, the DevDigest YouTube channel walks through the wrapper, the breaker, and the dashboards on a real service. The patterns are not exotic. The discipline of actually shipping them is what separates the apps that stay up from the ones that don't.
Reliability is not a feature you add at the end. It is the layer underneath every Claude call from request one. Build it once, build it right, and the model failures stop being your problem.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Anthropic's Python SDK for building production agent systems. Tool use, guardrails, agent handoffs, and orchestration. R...
View ToolAnthropic's AI. Opus 4.6 for hard problems, Sonnet 4.6 for speed, Haiku 4.5 for cost. 200K context window. Best coding m...
View ToolAnthropic's flagship reasoning model. Best-in-class for coding, long-context analysis, and agentic workflows. 1M token c...
View ToolAnthropic's agentic coding CLI. Runs in your terminal, edits files autonomously, spawns sub-agents, and maintains memory...
View ToolOne control panel for Claude Code, Codex, Gemini, Cursor, and 10+ AI coding harnesses. Desktop app for Mac.
Open AppBuild, test, and iterate agent skills from the terminal. Create Claude Code skills with interview or one-liner.
Open AppPremium tier for the Skills marketplace. Unlock pro skills, private collections, and team sharing.
Open AppAutomatic session-to-session memory of build commands, errors, and learnings.
Claude CodeUse opus, sonnet, haiku, and best to switch models easily.
Claude CodeFires on tool execution errors for logging, alerting, and retry.
Claude Code
Nimbalyst Demo: A Visual Workspace for Codex + Claude Code with Kanban, Plans, and AI Commits Try it: https://nimbalyst.com/ Star Repo Here: https://github.com/Nimbalyst/nimbalyst This video demos N...

Claude Design by Anthropic: Generate a Design System From Your Repo + Build High-Fidelity UI Fast The video reviews Claude Design by Anthropic, calling it a highly differentiated product, and demonst...

Anthropic Releases Claude Opus 4.7: Benchmarks, Vision Upgrades, Memory, Pricing & New Claude Code Features Anthropic has released Opus 4.7, and the video covers the announcement, benchmark results, ...

The math of agent pipelines is brutal. 85% reliability per step compounds to about 20% at 10 steps. Here is why long cha...

A practical architecture for multi-step Claude agents. Loop patterns, state management, error recovery, and the producti...

A production-grade RAG pipeline with Claude. Chunking that survives real documents, retrieval tuning that actually moves...

Build MCP servers that connect Claude to your databases, APIs, and tools. Architecture, TypeScript SDK code, debugging,...

A deep comparison of Codex's new /goal loop and Claude managed agents outcomes, with practical workflow examples, contro...

A long-running coding agent is only useful if the environment around it can queue tasks, capture logs, checkpoint state,...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.