
TL;DR
How to ship Claude's Batch API in production. 50% cost savings, TypeScript SDK code, JSONL request format, and the async architecture gotchas that bite at 100k requests.
Read next
How to ship Claude's vision API in production. OCR, charts, UI audits, real cost numbers, TypeScript SDK code, and the gotchas that bite at 100k images a month.
13 min readCut Claude API spend by up to 90% with prompt caching. Real numbers, TypeScript SDK code, and the gotchas Anthropic's docs gloss over.
11 min readA production guide to Claude's extended thinking mode. Real cost math, TypeScript SDK code, and the tasks where reasoning tokens are worth 3x the spend.
12 min readIf you have a Claude workload that doesn't need to answer in real time, you are probably overpaying by 2x. The Batch API charges half - half - the per-token rate of the synchronous API in exchange for results within 24 hours. For nightly reports, content classification, synthetic data generation, document analysis, and most agent evaluations, that trade is a no-brainer. And yet most teams I look at still run those workloads through the live API because nobody wants to refactor the request loop.
This is the version of the docs I wish I had the first time I moved a real workload to batch. We will cover the request format, the SDK code you should ship, the architecture changes async forces on your stack, and the failure modes you only learn about after a 100k-request batch silently drops 4% of its results.
We walked through a real migration in our Batch Processing for Scale: 100k Requests/Day video on YouTube. This post is the production-grade companion.
Batch is the right answer when:
Batch is the wrong answer when:
The hybrid pattern - synchronous for user-facing work, batch for everything else - is what mature deployments look like. Customer-asked-a-question goes live; nightly data labeling goes batch.
You package up to 100,000 message-create requests into a single request file. You submit it. Anthropic processes them in parallel on a separate, lower-priority infrastructure. You poll for completion or subscribe to a webhook. When done, you download a results file and match results back to your inputs by custom_id.
Three properties to internalize:
custom_id. Anyone who ignores this and zips inputs to outputs gets bitten.Here is a minimal but production-shaped batch submission using the official Anthropic SDK. It builds requests with stable IDs, submits, polls, and reconciles results.
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
type ClassifyInput = { id: string; text: string };
const SYSTEM = "Classify the input as one of: bug, feature, question, spam. Return only the label.";
function buildRequests(inputs: ClassifyInput[]) {
return inputs.map((input) => ({
custom_id: input.id,
params: {
model: "claude-haiku-4-5" as const,
max_tokens: 16,
system: SYSTEM,
messages: [{ role: "user" as const, content: input.text }],
},
}));
}
export async function classifyBatch(inputs: ClassifyInput[]) {
const batch = await client.messages.batches.create({
requests: buildRequests(inputs),
});
let status = batch;
while (status.processing_status !== "ended") {
await new Promise((r) => setTimeout(r, 30_000));
status = await client.messages.batches.retrieve(batch.id);
}
const results: Record<string, string> = {};
for await (const result of await client.messages.batches.results(batch.id)) {
if (result.result.type === "succeeded") {
const block = result.result.message.content[0];
results[result.custom_id] = block.type === "text" ? block.text.trim() : "";
} else {
results[result.custom_id] = `__error__:${result.result.type}`;
}
}
return results;
}
A few non-obvious things this captures:
custom_id is yours to define and must be unique per batch. We use whatever primary key the input row has in our database. This is how you reconcile results to records.results() iterator streams a JSONL file. Do not try to load it into memory for large batches; iterate.Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Apr 29, 2026 • 9 min read
Apr 29, 2026 • 10 min read
Apr 29, 2026 • 9 min read
Apr 29, 2026 • 9 min read
Polling at scale is a trap. Each batch runs for an unknown duration, and a fleet of polling workers either churns CPU or sleeps too long. Webhooks are the right pattern.
Configure a webhook endpoint in your Anthropic console or via API, and Anthropic will POST a notification when a batch finishes. Your handler:
Two operational notes from production: webhooks can fire more than once (idempotent handlers, please), and the events occasionally arrive out of order with the batch's own status field. Trust the batch's status when you fetch it, not the event payload alone.
The bug you don't notice in dev and absolutely will hit in prod: a 99,500-row result file for a 100,000-row batch, and you ship the answer assuming all rows came back.
The reconciliation pattern that survives:
async function reconcile(batchId: string, expectedIds: Set<string>) {
const seen = new Set<string>();
const errors: { id: string; reason: string }[] = [];
const ok: { id: string; output: string }[] = [];
for await (const r of await client.messages.batches.results(batchId)) {
seen.add(r.custom_id);
if (r.result.type === "succeeded") {
const block = r.result.message.content[0];
ok.push({ id: r.custom_id, output: block.type === "text" ? block.text : "" });
} else {
errors.push({ id: r.custom_id, reason: r.result.type });
}
}
const missing = [...expectedIds].filter((id) => !seen.has(id));
return { ok, errors, missing };
}
You always end up with three buckets: succeeded, errored, and missing. Each bucket needs a path. Errored rows are eligible for retry on the live API or a follow-up batch. Missing rows are the silent failure mode and almost always indicate a bug in how you constructed the batch (duplicate custom_ids, oversized rows, encoding problems).
The 50% discount applies to both input and output tokens. Cache reads, cache writes, thinking tokens - all 50% off in batch mode. There is no separate "batch tier" to opt into beyond using the batch endpoint.
The savings stack with caching. A nightly classification batch of 100k rows with a 4k-token shared instruction:
For workloads with stable instructions and high request counts, batch + cache is the cheapest configuration available on the Claude API today.
For monitoring batch spend over time and catching regressions, CodeBurn tracks per-batch tokens and per-job cost so you can spot the runaway batch before the invoice shows up.
1. Daily content classification. Tag a few hundred thousand rows of incoming user content with categories. Haiku in batch mode runs about $1 per 100k rows for short text. We have shipped this on three different products.
2. Scheduled report generation. Generate per-customer summaries overnight. The latency floor is fine because the report lands in the inbox at 7am anyway. Sonnet in batch mode is the right model.
3. Synthetic data generation. Generating eval datasets, training sets, or persona-conditioned variants. These are big, embarrassingly parallel, and not user-facing. Perfect batch fit.
4. Eval suites. Running a 10k-prompt eval against five models takes 50,000 calls. Batch all of them, reconcile, compute metrics. Cuts both cost and wall-clock time vs. naive serial calling.
5. Document re-indexing. Re-summarizing or re-extracting from a corpus when you change your prompt. Every team I know that runs RAG eventually has a "rebuild all summaries" job; this is what batch is for.
1. Per-batch row limits. Up to 100,000 requests or 256MB total, whichever hits first. For larger jobs, shard into multiple batches. A simple sharding scheme: hash custom_id mod N.
2. Rate limits still apply. Batches share quota with synchronous calls in a way that surprised us the first time. A huge batch can cause your synchronous calls to hit overloaded_error. If your live traffic shares the API key with batch, watch the interaction.
3. Cache reads work, cache writes do not propagate to live calls. A cache write inside a batch creates an entry usable by other batch and live calls - but the timing means you generally get cache writes per-row, not benefit from cache reads. Cache the prefix shared across all rows by ensuring the first row in the batch creates the cache, and letting subsequent rows read it. In practice the SDK handles this automatically when cache_control is set on shared prefix blocks.
4. Tool use works, but is rarely useful in batch. Batch is for one-shot calls. If a row requires a tool result, you cannot loop within the batch - you would need to take the tool calls out, run them yourself, and submit a follow-up batch. Most batch workloads should not use tools at all.
5. The 24-hour SLA is a ceiling, not a target. Median completion in our measurements is 5 to 30 minutes. Plan for 24h, but don't design dashboards that show "still processing" for the first 12 hours and panic users.
6. Batch IDs are forever in your DB. You will want to query by batch ID six months later when an auditor asks where row 47,318 of a March classification job went. Store batch IDs alongside the records they generated, not in a side table that gets pruned.
The pattern mature production stacks land on:
[user request]
|- if interactive -> [synchronous API w/ caching]
|- if backgroundable -> [enqueue]
|- [batch worker pulls every N min]
|- [submits batch]
|- [webhook -> reconcile -> notify]
The interesting decision is the boundary. We move workloads to batch the moment we can answer "is anyone watching the spinner" with "no." Reports, daily syncs, classification, evals, summaries, embeddings (when we still need them). Anything user-facing stays synchronous, often on Haiku, often with thinking off.
The 400-Dollar Overnight Bill post-mortem walks through what happens when this boundary blurs and a "should be batchable" job ends up on the live API at full price.
custom_id set to your stable record ID, unique per batchThe Batch API is the cheapest legitimate cost lever on the Claude platform after prompt caching. The tax is a real architecture change to async, and you should not pretend otherwise. But for teams running any kind of bulk classification, generation, or analysis, it is straightforwardly half the bill for the same answers.
For more on optimizing Claude in production, see our writeups on prompt caching and tool use patterns.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.

A production guide to Claude's extended thinking mode. Real cost math, TypeScript SDK code, and the tasks where reasonin...

Cut Claude API spend by up to 90% with prompt caching. Real numbers, TypeScript SDK code, and the gotchas Anthropic's do...

How to ship Claude's vision API in production. OCR, charts, UI audits, real cost numbers, TypeScript SDK code, and the g...

Master tool use in the Claude API. Schema design, retry logic, multi-step loops, and the failure modes that only show up...

DeepSeek V4 is trending because it is close enough to frontier coding models at a much lower token price. The real quest...

A practical architecture for multi-step Claude agents. Loop patterns, state management, error recovery, and the producti...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.