Claude Batch API: Cutting Async Workload Costs In Half

Developers Digest•April 29, 2026•11 min read

Claude API Anthropic SDK Batch API Cost Optimization Async

TL;DR

How to ship Claude's Batch API in production. 50% cost savings, TypeScript SDK code, JSONL request format, and the async architecture gotchas that bite at 100k requests.

Claude Vision API: Image Analysis At Production Scale

How to ship Claude's vision API in production. OCR, charts, UI audits, real cost numbers, TypeScript SDK code, and the gotchas that bite at 100k images a month.

13 min read

Prompt Caching in the Claude API: A Production Guide

Cut Claude API spend by up to 90% with prompt caching. Real numbers, TypeScript SDK code, and the gotchas Anthropic's docs gloss over.

11 min read

Extended Thinking in Claude: When Deep Reasoning Pays For Itself

A production guide to Claude's extended thinking mode. Real cost math, TypeScript SDK code, and the tasks where reasoning tokens are worth 3x the spend.

12 min read

A 50% Discount Hiding In Plain Sight

If you have a Claude workload that doesn't need to answer in real time, you are probably overpaying by 2x. The Batch API charges half - half - the per-token rate of the synchronous API in exchange for results within 24 hours. For nightly reports, content classification, synthetic data generation, document analysis, and most agent evaluations, that trade is a no-brainer. And yet most teams I look at still run those workloads through the live API because nobody wants to refactor the request loop.

This is the version of the docs I wish I had the first time I moved a real workload to batch. We will cover the request format, the SDK code you should ship, the architecture changes async forces on your stack, and the failure modes you only learn about after a 100k-request batch silently drops 4% of its results.

We walked through a real migration in our Batch Processing for Scale: 100k Requests/Day video on YouTube. This post is the production-grade companion.

When Batch Wins, And When It Loses

Batch is the right answer when:

The user doesn't see the result immediately. Reports, classification jobs, ETL pipelines, eval suites.
You can tolerate up to a 24-hour SLA. Most batches finish in well under an hour, but you should not promise less than 24h to your callers.
Volume is non-trivial. Below ~1000 requests per batch, the operational overhead of async eats the cost win.
The cost per request matters. At 5 cents a call, 50% off is a real number; at 0.05 cents a call, the engineering cost dwarfs the savings.

Batch is the wrong answer when:

A user is waiting on the response. Latency is unbounded.
The workload is spiky and unpredictable. Batches benefit from steady throughput.
Each request depends on the last. Conversations, agent loops, and anything stateful belong on the synchronous API.

The hybrid pattern - synchronous for user-facing work, batch for everything else - is what mature deployments look like. Customer-asked-a-question goes live; nightly data labeling goes batch.

What Batch Actually Is, Mechanically

You package up to 100,000 message-create requests into a single request file. You submit it. Anthropic processes them in parallel on a separate, lower-priority infrastructure. You poll for completion or subscribe to a webhook. When done, you download a results file and match results back to your inputs by custom_id.

Three properties to internalize:

Each row is an independent message-create call. You get the full request body - model, messages, tools, system, thinking, cache_control, everything. Batch is "synchronous API at half price with a delayed return," not a different API.
Order is not preserved. Results come back in whatever order they finished. You match on custom_id. Anyone who ignores this and zips inputs to outputs gets bitten.
Failures are per-request, not per-batch. Individual rows can fail (rate limit on a tool call, validation error, anything) without taking the batch down. You must scan the result file for errors.

The TypeScript Code You Should Actually Ship

Here is a minimal but production-shaped batch submission using the official Anthropic SDK. It builds requests with stable IDs, submits, polls, and reconciles results.

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

type ClassifyInput = { id: string; text: string };

const SYSTEM = "Classify the input as one of: bug, feature, question, spam. Return only the label.";

function buildRequests(inputs: ClassifyInput[]) {
  return inputs.map((input) => ({
    custom_id: input.id,
    params: {
      model: "claude-haiku-4-5" as const,
      max_tokens: 16,
      system: SYSTEM,
      messages: [{ role: "user" as const, content: input.text }],
    },
  }));
}

export async function classifyBatch(inputs: ClassifyInput[]) {
  const batch = await client.messages.batches.create({
    requests: buildRequests(inputs),
  });

  let status = batch;
  while (status.processing_status !== "ended") {
    await new Promise((r) => setTimeout(r, 30_000));
    status = await client.messages.batches.retrieve(batch.id);
  }

  const results: Record<string, string> = {};
  for await (const result of await client.messages.batches.results(batch.id)) {
    if (result.result.type === "succeeded") {
      const block = result.result.message.content[0];
      results[result.custom_id] = block.type === "text" ? block.text.trim() : "";
    } else {
      results[result.custom_id] = `__error__:${result.result.type}`;
    }
  }
  return results;
}

A few non-obvious things this captures:

custom_id is yours to define and must be unique per batch. We use whatever primary key the input row has in our database. This is how you reconcile results to records.
The SDK's results() iterator streams a JSONL file. Do not try to load it into memory for large batches; iterate.
The polling loop here is naive. In production, use a webhook or a worker that wakes on schedule. A 30-second poll is fine for prototyping; do not run it for thousands of batches in parallel.

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

From the archive

Claude Design: Anthropic's Bet That Designers and Developers Want the Same Tool

Apr 29, 2026 • 9 min read

Claude Opus 4.7: The Developer's Guide to Anthropic's New Flagship

Apr 29, 2026 • 10 min read

Cloudflare Agent Memory: A Developer's Guide to the New Primitive

Apr 29, 2026 • 9 min read

Flagship: Cloudflare Feature Flags for AI Apps

Apr 29, 2026 • 9 min read

Webhooks Beat Polling Every Time

Polling at scale is a trap. Each batch runs for an unknown duration, and a fleet of polling workers either churns CPU or sleeps too long. Webhooks are the right pattern.

Configure a webhook endpoint in your Anthropic console or via API, and Anthropic will POST a notification when a batch finishes. Your handler:

Verifies the signature
Pulls the results file via the SDK
Iterates and writes to your database
Marks the job as complete

Two operational notes from production: webhooks can fire more than once (idempotent handlers, please), and the events occasionally arrive out of order with the batch's own status field. Trust the batch's status when you fetch it, not the event payload alone.

Reconciliation: The Single Most Important Step

The bug you don't notice in dev and absolutely will hit in prod: a 99,500-row result file for a 100,000-row batch, and you ship the answer assuming all rows came back.

The reconciliation pattern that survives:

async function reconcile(batchId: string, expectedIds: Set<string>) {
  const seen = new Set<string>();
  const errors: { id: string; reason: string }[] = [];
  const ok: { id: string; output: string }[] = [];

  for await (const r of await client.messages.batches.results(batchId)) {
    seen.add(r.custom_id);
    if (r.result.type === "succeeded") {
      const block = r.result.message.content[0];
      ok.push({ id: r.custom_id, output: block.type === "text" ? block.text : "" });
    } else {
      errors.push({ id: r.custom_id, reason: r.result.type });
    }
  }

  const missing = [...expectedIds].filter((id) => !seen.has(id));
  return { ok, errors, missing };
}

You always end up with three buckets: succeeded, errored, and missing. Each bucket needs a path. Errored rows are eligible for retry on the live API or a follow-up batch. Missing rows are the silent failure mode and almost always indicate a bug in how you constructed the batch (duplicate custom_ids, oversized rows, encoding problems).

Cost Modeling: Where The 50% Actually Lands

The 50% discount applies to both input and output tokens. Cache reads, cache writes, thinking tokens - all 50% off in batch mode. There is no separate "batch tier" to opt into beyond using the batch endpoint.

The savings stack with caching. A nightly classification batch of 100k rows with a 4k-token shared instruction:

Sync: 100k x 4000 input tokens + 100k x 50 output, about $1,275 at Haiku rates
Sync + caching: instruction cached once per 5 min, about $200 if calls are bursty
Batch + caching: about $100

For workloads with stable instructions and high request counts, batch + cache is the cheapest configuration available on the Claude API today.

For monitoring batch spend over time and catching regressions, CodeBurn tracks per-batch tokens and per-job cost so you can spot the runaway batch before the invoice shows up.

Real-World Batch Workloads

1. Daily content classification. Tag a few hundred thousand rows of incoming user content with categories. Haiku in batch mode runs about $1 per 100k rows for short text. We have shipped this on three different products.

2. Scheduled report generation. Generate per-customer summaries overnight. The latency floor is fine because the report lands in the inbox at 7am anyway. Sonnet in batch mode is the right model.

3. Synthetic data generation. Generating eval datasets, training sets, or persona-conditioned variants. These are big, embarrassingly parallel, and not user-facing. Perfect batch fit.

4. Eval suites. Running a 10k-prompt eval against five models takes 50,000 calls. Batch all of them, reconcile, compute metrics. Cuts both cost and wall-clock time vs. naive serial calling.

5. Document re-indexing. Re-summarizing or re-extracting from a corpus when you change your prompt. Every team I know that runs RAG eventually has a "rebuild all summaries" job; this is what batch is for.

Production Gotchas Worth Pinning To Your Wall

1. Per-batch row limits. Up to 100,000 requests or 256MB total, whichever hits first. For larger jobs, shard into multiple batches. A simple sharding scheme: hash custom_id mod N.

2. Rate limits still apply. Batches share quota with synchronous calls in a way that surprised us the first time. A huge batch can cause your synchronous calls to hit overloaded_error. If your live traffic shares the API key with batch, watch the interaction.

3. Cache reads work, cache writes do not propagate to live calls. A cache write inside a batch creates an entry usable by other batch and live calls - but the timing means you generally get cache writes per-row, not benefit from cache reads. Cache the prefix shared across all rows by ensuring the first row in the batch creates the cache, and letting subsequent rows read it. In practice the SDK handles this automatically when cache_control is set on shared prefix blocks.

4. Tool use works, but is rarely useful in batch. Batch is for one-shot calls. If a row requires a tool result, you cannot loop within the batch - you would need to take the tool calls out, run them yourself, and submit a follow-up batch. Most batch workloads should not use tools at all.

5. The 24-hour SLA is a ceiling, not a target. Median completion in our measurements is 5 to 30 minutes. Plan for 24h, but don't design dashboards that show "still processing" for the first 12 hours and panic users.

6. Batch IDs are forever in your DB. You will want to query by batch ID six months later when an auditor asks where row 47,318 of a March classification job went. Store batch IDs alongside the records they generated, not in a side table that gets pruned.

Hybrid Architecture: The Endgame

The pattern mature production stacks land on:

[user request]
    |- if interactive -> [synchronous API w/ caching]
    |- if backgroundable -> [enqueue]
                            |- [batch worker pulls every N min]
                                  |- [submits batch]
                                        |- [webhook -> reconcile -> notify]

The interesting decision is the boundary. We move workloads to batch the moment we can answer "is anyone watching the spinner" with "no." Reports, daily syncs, classification, evals, summaries, embeddings (when we still need them). Anything user-facing stays synchronous, often on Haiku, often with thinking off.

The 400-Dollar Overnight Bill post-mortem walks through what happens when this boundary blurs and a "should be batchable" job ends up on the live API at full price.

Production Checklist Before You Ship

The Batch API is the cheapest legitimate cost lever on the Claude platform after prompt caching. The tax is a real architecture change to async, and you should not pretend otherwise. But for teams running any kind of bulk classification, generation, or analysis, it is straightforwardly half the bill for the same answers.

For more on optimizing Claude in production, see our writeups on prompt caching and tool use patterns.

Share

Suggest an editSave

Discuss this article on Twitter/X

Developers Digest

Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.

300+ videos30K+ GitHub stars50+ articles

Subscribe YouTube GitHub Twitter/X

Comments

Related Guides

Guide

Async Hooks - Claude Code

Run hooks in the background without blocking the session.

Claude Code

Guide

MultiEdit - Claude Code

Batch edit multiple files in a single atomic operation.

Claude Code

Guide

Bundled Skills - Claude Code

/simplify, /batch, /debug, /fast, and other built-in skills.

Claude Code

Claude Vision API: Image Analysis At Production Scale

Prompt Caching in the Claude API: A Production Guide

Extended Thinking in Claude: When Deep Reasoning Pays For Itself

A 50% Discount Hiding In Plain Sight

When Batch Wins, And When It Loses

What Batch Actually Is, Mechanically

The TypeScript Code You Should Actually Ship

Claude Design: Anthropic's Bet That Designers and Developers Want the Same Tool

Claude Opus 4.7: The Developer's Guide to Anthropic's New Flagship

Cloudflare Agent Memory: A Developer's Guide to the New Primitive

Flagship: Cloudflare Feature Flags for AI Apps

Webhooks Beat Polling Every Time

Reconciliation: The Single Most Important Step

Cost Modeling: Where The 50% Actually Lands

Real-World Batch Workloads

Production Gotchas Worth Pinning To Your Wall

Hybrid Architecture: The Endgame

Production Checklist Before You Ship

Comments

Related Guides

Async Hooks - Claude Code

MultiEdit - Claude Code

Bundled Skills - Claude Code

Related Posts

Extended Thinking in Claude: When Deep Reasoning Pays For Itself

Prompt Caching in the Claude API: A Production Guide

Claude Vision API: Image Analysis At Production Scale

Tool Use in the Claude API: Production Patterns for Reliable Agents

DeepSeek V4 Changes the Coding Agent Cost Equation

Agent Architecture: Building Multi-Step AI Workflows That Survive Production

Get Smarter About AI Dev

Claude Vision API: Image Analysis At Production Scale

Prompt Caching in the Claude API: A Production Guide

Extended Thinking in Claude: When Deep Reasoning Pays For Itself

A 50% Discount Hiding In Plain Sight

When Batch Wins, And When It Loses

What Batch Actually Is, Mechanically

The TypeScript Code You Should Actually Ship

Claude Design: Anthropic's Bet That Designers and Developers Want the Same Tool

Claude Opus 4.7: The Developer's Guide to Anthropic's New Flagship

Cloudflare Agent Memory: A Developer's Guide to the New Primitive

Flagship: Cloudflare Feature Flags for AI Apps

Webhooks Beat Polling Every Time

Reconciliation: The Single Most Important Step

Cost Modeling: Where The 50% Actually Lands

Real-World Batch Workloads

Production Gotchas Worth Pinning To Your Wall

Hybrid Architecture: The Endgame

Production Checklist Before You Ship

Comments

Related Guides

Async Hooks - Claude Code

MultiEdit - Claude Code

Bundled Skills - Claude Code

Related Posts

Extended Thinking in Claude: When Deep Reasoning Pays For Itself

Prompt Caching in the Claude API: A Production Guide

Claude Vision API: Image Analysis At Production Scale

Tool Use in the Claude API: Production Patterns for Reliable Agents

DeepSeek V4 Changes the Coding Agent Cost Equation

Agent Architecture: Building Multi-Step AI Workflows That Survive Production

Get Smarter About AI Dev