Claude API Reliability: Error Handling Best Practices

The reason your Claude app falls over

Most production incidents I have seen with Claude integrations are not about the model. They are about the network between you and the model.

For broader context, pair this with What Is Claude Code? The Complete Guide for 2026 and 60 Claude Code Tips and Tricks for Power Users; those companion pieces show where this fits in the wider AI developer workflow.

Rate limits. Timeouts. Transient 5xx. The occasional auth blip when an API key gets rotated and not propagated. None of these are interesting bugs. All of them will take your app down if you don't have the defensive layer in place.

The good news: the defensive layer is small. Maybe 200 lines of code. Once it is in, your reliability story flips from "depends on whether Anthropic had a good day" to "depends on whether you wrote your retry loop correctly," which is a much better problem to have.

This post is the playbook. Error categorization, retry shape, rate limit handling, fallback strategy, and the monitoring you need to know any of it is working.

The five error categories that matter

Anthropic returns standard HTTP status codes plus a structured error body. The five buckets:

400 - bad request. Invalid JSON, malformed message structure, schema violation on tool input. Permanent. Do not retry. Log the request body and fix the call site. The error message in the response body usually points directly at the problem.

401 - authentication. API key is missing, wrong, or revoked. Permanent for this request. Do not retry. Alert immediately - this means your secrets pipeline is broken.

429 - rate limit. Transient. Retry with exponential backoff. Anthropic returns a retry-after header on rate limit responses. Respect it. Don't guess.

500/502/503/504 - server errors. Transient. Retry with backoff. The 504 specifically means a timeout on Anthropic's side, often during a slow generation. If it keeps happening on the same request, the prompt is too long or the max_tokens is too high.

Network errors. ECONNRESET, ETIMEDOUT, DNS failures, TLS handshake failures. Transient. Retry with backoff. These are not from Anthropic, they are from the path between you and them.

The Anthropic SDK exposes these as typed errors, which makes the dispatch clean.

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

function isRetryable(err: unknown): boolean {
  if (err instanceof Anthropic.APIError) {
    if (err.status === 429) return true;
    if (err.status && err.status >= 500) return true;
    return false;
  }
  if (err instanceof Anthropic.APIConnectionError) return true;
  if (err instanceof Anthropic.APIConnectionTimeoutError) return true;
  return false;
}

Two errors not in the SDK type hierarchy that you still want to handle: bare fetch failures from a flaky network and JSON parse errors from a truncated response. Both are transient. Both want retries.

Retry with exponential backoff and jitter

The naive retry is "wait one second, try again, wait two, try again." This works for one client. With a hundred concurrent clients all retrying on the same wall-clock schedule, you get a thundering herd that takes the upstream down a second time the moment it recovers.

The fix is jitter. Add random noise to the backoff so retries spread out instead of stacking. Full jitter (random between zero and the cap) is the safest variant.

async function sleep(ms: number) {
  return new Promise((r) => setTimeout(r, ms));
}

function backoffMs(attempt: number, base = 500, cap = 30_000): number {
  const exp = Math.min(cap, base * 2 ** attempt);
  return Math.floor(Math.random() * exp);
}

async function withRetry<T>(
  fn: () => Promise<T>,
  options: { maxAttempts?: number; onRetry?: (attempt: number, err: unknown) => void } = {}
): Promise<T> {
  const maxAttempts = options.maxAttempts ?? 5;
  let lastErr: unknown;

  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (err) {
      lastErr = err;
      if (!isRetryable(err) || attempt === maxAttempts - 1) throw err;

      let delay = backoffMs(attempt);
      if (err instanceof Anthropic.APIError && err.status === 429) {
        const retryAfter = err.headers?.["retry-after"];
        if (retryAfter) delay = Math.max(delay, Number(retryAfter) * 1000);
      }

      options.onRetry?.(attempt, err);
      await sleep(delay);
    }
  }

  throw lastErr;
}

const response = await withRetry(() =>
  client.messages.create({
    model: "claude-sonnet-4-5",
    max_tokens: 1024,
    messages: [{ role: "user", content: "Hello" }],
  })
);

A few things worth noting. The retry-after header takes precedence over the calculated backoff because Anthropic knows better than you do when to come back. The onRetry callback is where you push metrics - don't skip it. The max-attempts cap is at five because beyond that you are throwing good money after bad, and the human at the other end of your API has been waiting too long anyway.

Note that the official Anthropic SDKs already do retry under the hood with sensible defaults. You can configure this with maxRetries on the client. The wrapper above is for the cases where you want explicit control - logging every retry, custom backoff shape, integration with a circuit breaker.

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

From the archive

Claude Batch API: Cutting Async Workload Costs In Half

Apr 29, 2026 • 11 min read

Claude Design: Anthropic's Bet That Designers and Developers Want the Same Tool

Apr 29, 2026 • 9 min read

Claude Opus 4.7: The Developer's Guide to Anthropic's New Flagship

Apr 29, 2026 • 10 min read

Cloudflare Agent Memory: A Developer's Guide to the New Primitive

Apr 29, 2026 • 9 min read

Circuit breakers stop the bleeding

When Anthropic is having a real outage (rare, but it happens), retrying is counterproductive. You burn budget on doomed requests and prevent your app from falling back to a degraded mode that might actually serve users.

Circuit breakers solve this. Track recent failure rate. When it crosses a threshold, open the breaker - skip the upstream entirely and short-circuit to a fallback. Periodically try one request to see if upstream is back. Close the breaker when it recovers.

type BreakerState = "closed" | "open" | "half-open";

class CircuitBreaker {
  private state: BreakerState = "closed";
  private failures = 0;
  private openedAt = 0;

  constructor(
    private threshold = 5,
    private resetMs = 30_000
  ) {}

  async run<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === "open") {
      if (Date.now() - this.openedAt > this.resetMs) this.state = "half-open";
      else throw new Error("circuit open");
    }

    try {
      const result = await fn();
      if (this.state === "half-open") {
        this.state = "closed";
        this.failures = 0;
      }
      return result;
    } catch (err) {
      this.failures++;
      if (this.failures >= this.threshold) {
        this.state = "open";
        this.openedAt = Date.now();
      }
      throw err;
    }
  }
}

const breaker = new CircuitBreaker();

async function callClaude(input: string) {
  return breaker.run(() =>
    withRetry(() =>
      client.messages.create({
        model: "claude-sonnet-4-5",
        max_tokens: 1024,
        messages: [{ role: "user", content: input }],
      })
    )
  );
}

Tune the threshold to your traffic. For an app doing 1000 RPM, a threshold of 5 fires too eagerly on a normal 1 percent failure rate. For an app doing 10 RPM, a threshold of 50 takes 5 minutes to detect a real outage. A rolling-window failure rate (e.g. open if 50 percent of the last 20 requests failed) is more robust than absolute counts.

Rate limits: prevent before retry

Retrying 429s is fine. Not hitting them is better.

Anthropic rate limits are organization-wide and measured in input tokens per minute, output tokens per minute, and requests per minute. The exact numbers depend on your usage tier. The mistake teams make is assuming the rate limit applies per-key or per-host - it does not.

Three patterns to stay under the limit.

Token bucket on your side. Estimate the input + output tokens for each call. Refill the bucket at your tier's rate. Block (or queue) when the bucket is empty. This is the single best rate limit prevention pattern.

Spread bursts. If you have 100 agent runs to launch, don't fire them all at once. Stagger over 60 seconds. Same total throughput, none of the burst pain.

Watch the headers. Anthropic returns anthropic-ratelimit-requests-remaining and similar headers on every response. When remaining drops below 10 percent of your limit, slow down. The headers are the only honest source of truth.

class TokenBucket {
  private tokens: number;
  private lastRefill = Date.now();

  constructor(private capacity: number, private refillPerMs: number) {
    this.tokens = capacity;
  }

  async take(amount: number): Promise<void> {
    while (true) {
      this.refill();
      if (this.tokens >= amount) {
        this.tokens -= amount;
        return;
      }
      const needed = amount - this.tokens;
      const waitMs = Math.ceil(needed / this.refillPerMs);
      await sleep(waitMs);
    }
  }

  private refill() {
    const now = Date.now();
    const delta = (now - this.lastRefill) * this.refillPerMs;
    this.tokens = Math.min(this.capacity, this.tokens + delta);
    this.lastRefill = now;
  }
}

Timeouts: pick a number and enforce it

Default SDK timeout is 10 minutes. That is too long for almost every production use case. Pick a real number based on your UX budget.

Interactive chat: 30 seconds total, but stream so the user sees tokens at 1s. If generation hasn't finished by 30s, cancel.

Agent step: 60-90 seconds. Long enough for a slow tool result to come back. Short enough that a stuck agent doesn't burn budget for an hour.

Batch job: whatever your batch SLA allows, but always finite.

const response = await client.messages.create(
  {
    model: "claude-sonnet-4-5",
    max_tokens: 1024,
    messages: [{ role: "user", content: "Hello" }],
  },
  { timeout: 30_000 }
);

For streaming requests, time-to-first-token and total-time are different timeouts. Time-to-first-token should be 5-10s. Total time depends on how long an answer you want.

Fallback strategies

When Claude is down or slow or rate-limited and you cannot wait, the question is what to do instead. Options in increasing order of "effort to set up":

Cached response. If the same question was asked recently, return the prior answer with a "cached" badge. Free, fast, sometimes wrong.

A smaller model. Fall back from Sonnet to Haiku. Lower quality, much higher availability headroom, lower cost. For a lot of use cases the user won't notice.

A different provider. Have a parallel path through OpenAI or Google. Triggered only when the breaker is open. Be honest with yourself about what fraction of your prompts are portable - tool use schemas, prompt caching, and extended thinking will not transfer cleanly.

A static degraded UX. Show "AI is temporarily unavailable, here are some prewritten resources." Last resort. Better than a blank page.

The right choice depends on the business cost of latency versus the business cost of a wrong answer. For a customer support chatbot, cached or smaller-model is usually better. For a financial advisor, static degraded is better.

Observability: you can't fix what you can't see

The reliability work above does nothing if you don't know when it fires. Five metrics, every Claude call.

Latency p50, p95, p99. Error rate by status code. Tokens in, tokens out, cost per call. Cache hit rate (if you are using prompt caching). Retry count distribution.

Log structured. Every call gets request_id, status, latency, tokens, retries, model. The Anthropic response headers give you a request ID that is what their support team will ask for if you file a ticket. Save it.

We run all of this on agent-finops for cost and rate-limit visibility, and replay the actual prompt/response pairs through tracetrail when something looks weird. The combination is the difference between "we had an incident yesterday" and "we know exactly what happened, here is the fix."

If you want to see the full reliability stack assembled in a working app, the DevDigest YouTube channel walks through the wrapper, the breaker, and the dashboards on a real service. The patterns are not exotic. The discipline of actually shipping them is what separates the apps that stay up from the ones that don't.

Reliability is not a feature you add at the end. It is the layer underneath every Claude call from request one. Build it once, build it right, and the model failures stop being your problem.