
TL;DR
Cut Claude API spend by up to 90% with prompt caching. Real numbers, TypeScript SDK code, and the gotchas Anthropic's docs gloss over.
| Resource | Link |
|---|---|
| Prompt Caching Documentation | docs.anthropic.com/en/docs/build-with-claude/prompt-caching |
| Messages API Reference | docs.anthropic.com/en/api/messages |
| Anthropic SDK (TypeScript) | github.com/anthropics/anthropic-sdk-typescript |
| Anthropic Pricing | anthropic.com/pricing |
| Claude Models | docs.anthropic.com/en/docs/about-claude/models |
Every team I have looked at running Claude in production with a system prompt over 2k tokens is paying full freight on tokens they could be getting for a tenth of the price. Prompt caching has been generally available on the Anthropic API for over a year, and it is still the single biggest cost lever most apps have not pulled. Once you set it up correctly, cached input tokens cost about 10% of normal input tokens on read, and your time-to-first-token on a long-context call drops from seconds to a few hundred milliseconds.
This guide is the version of the docs I wish I had the first time I shipped a caching layer to production. We will cover what the cache actually does, when it pays off, the SDK code you should ship, the cache-invalidation footguns, and how to monitor hit rate so you actually realize the savings instead of assuming them.
We covered the basics in our Prompt Caching Explained video on YouTube. This post is the long-form, production-grade companion.
Anthropic's prompt cache is a server-side, ephemeral cache keyed on the exact byte sequence of your prompt prefix. When you mark a block with cache_control: { type: "ephemeral" }, the API stores the model's internal state at that breakpoint. The next request that arrives within the TTL with the same prefix up to that breakpoint hits the cache.
There are two TTL options:
Cache writes are slightly more expensive than a normal call. Cache reads are dramatically cheaper. The math is simple: if you reuse the same prefix more than once or twice within the TTL window, caching wins. If you do not, it loses.
What the docs do not loudly say:
The break-even is roughly: if a prefix is reused more than once or twice within five minutes, cache it. If you are running Fable 5, the prompt caching economics on Fable 5 post works through this break-even at its specific cache-write and cache-hit rates. Specific scenarios where the ROI is obvious:
Where caching is overhead:
Here is a minimal but production-shaped example using the official Anthropic SDK. It caches a long system prompt and a static knowledge-base block, leaving the user message uncached.
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const SYSTEM_PROMPT = `You are a senior support engineer for Acme Cloud.
Answer using only the provided knowledge base. Cite section IDs.
[... 3000 more tokens of policy, tone, and examples ...]`;
const KNOWLEDGE_BASE = await loadKnowledgeBase(); // ~15k tokens, stable per deploy
export async function answer(userMessage: string) {
const response = await client.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 1024,
system: [
{
type: "text",
text: SYSTEM_PROMPT,
cache_control: { type: "ephemeral" },
},
{
type: "text",
text: KNOWLEDGE_BASE,
cache_control: { type: "ephemeral" },
},
],
messages: [{ role: "user", content: userMessage }],
});
// Inspect cache usage on every response
console.log({
cache_creation: response.usage.cache_creation_input_tokens,
cache_read: response.usage.cache_read_input_tokens,
input: response.usage.input_tokens,
output: response.usage.output_tokens,
});
return response;
}
Two breakpoints, two cache layers. The first request after a deploy pays the write cost on both blocks. Every subsequent request inside five minutes reads both at ~10% of input price.
For a multi-turn chat, you want a third breakpoint on the last assistant message so the growing conversation history caches as it goes:
const messages = history.map((m, i) => {
const isLastAssistant =
m.role === "assistant" && i === history.length - 1;
return {
role: m.role,
content: isLastAssistant
? [{ type: "text", text: m.content, cache_control: { type: "ephemeral" } }]
: m.content,
};
});
This is the pattern used by every production chat app shipping on Claude. It keeps the rolling conversation cached without burning a breakpoint per turn.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Apr 29, 2026 • 10 min read
Apr 29, 2026 • 13 min read
Apr 29, 2026 • 12 min read
Apr 29, 2026 • 12 min read
1. Whitespace and JSON ordering. If your system prompt is built from a template that re-serializes a JSON object, key ordering can change between requests and silently kill the cache. Lock down your serializer or feed strings, not objects.
2. Timestamps in system prompts. "Today's date is 2026-04-29" at the top of the system prompt is a cache killer for every request from the next day onward. Move it into the user message or a separate uncached block at the end of the system array.
3. Tool definitions count as part of the prefix. If you reorder tools, you invalidate the cache. Sort tools deterministically.
4. The 5-minute TTL is rolling. Each cache hit refreshes the TTL. A high-traffic prompt stays warm forever. A low-traffic one dies between requests, and you pay write cost again. For prompts you call less than once every five minutes but want kept warm, use the 1-hour TTL even though the write cost is 2x - the math still wins above ~12 reads in an hour.
5. Streaming responses still report cache stats. They arrive in the final message_delta event. Do not assume cached calls skip usage reporting.
6. Cache misses on "obviously identical" prompts. Almost always one of: trailing whitespace, a Unicode normalization difference, a model version change, or a different max_tokens. The cache key includes more than just the text.
The mistake teams make with RAG plus caching is caching the wrong thing. The retrieved chunks are usually the most dynamic part of the prompt. The instructions and tool list are the most static. Cache those.
A clean layering for a RAG agent:
You get two breakpoints on Layer 1 and Layer 2. Layer 3 is fresh per query, no breakpoint. Layer 4 is just the user message. This pattern routinely takes a 25k-token RAG call from $0.075 input cost to about $0.012 on cache hits, with sub-second time-to-first-token.
The most common failure mode is shipping caching, declaring victory, and never noticing your hit rate is 30% because some request path is breaking the prefix. You need observability from day one. Every response includes usage.cache_creation_input_tokens and usage.cache_read_input_tokens - log both, then aggregate.
A useful metric is cache hit ratio:
hit_ratio = cache_read_tokens / (cache_read_tokens + cache_creation_tokens + uncached_input_tokens)
A healthy production prompt should sit above 0.85. Below 0.6, something is breaking the prefix and you should investigate. We built CodeBurn specifically to surface this metric across runs, and the same pattern works inside any FinOps dashboard you already have. The 400-Dollar Overnight Bill post-mortem walks through what happens when you do not.
Set an alert when hit ratio drops below a threshold for any prompt template. Almost every cache regression we have shipped was caught this way: a deploy added a new dynamic field to the system prompt, and the alert fired within an hour.
The cache is scoped to your API organization, not per-user. Two users hitting the same prompt prefix share the cache. This is great for shared system prompts and tool definitions. It is dangerous for anything that should be user-isolated.
Concrete patterns:
cache_controlPrompt caching is the closest thing to a free lunch in the Claude API. It is also the easiest optimization to ship broken and never notice. Get the breakpoints right, monitor the hit ratio, and you will pay roughly an order of magnitude less for the same workload.
For more on optimizing Claude in production, see our writeups on tool use patterns and building MCP servers.
Cached input tokens cost approximately 10% of the normal input token price when read from cache. For a typical production workload with a 15k-token system prompt that achieves an 85%+ cache hit rate, you can expect to save 70-90% on input token costs for those cached portions. The exact savings depend on your hit rate, prompt size, and whether you use the 5-minute or 1-hour TTL option.
The minimum cacheable block is 1024 tokens for Claude Sonnet and Opus models, and 2048 tokens for Claude Haiku. If your system prompt or cached block is smaller than this threshold, the cache control annotation is silently ignored and you pay full price. This is why caching is most effective for longer system prompts, knowledge bases, and tool definitions.
The most common causes of low cache hit rates are: timestamps or request IDs appearing before cache breakpoints, non-deterministic JSON serialization changing key order between requests, trailing whitespace differences, tool definitions being reordered between calls, or Unicode normalization differences. The cache key is an exact byte-sequence match, so even invisible differences invalidate the cache. Log cache_creation_input_tokens and cache_read_input_tokens on every response to diagnose.
You can use up to four cache breakpoints per request. Most production applications need two: one after the system prompt and one after static document context or tool definitions. For multi-turn chat, a third breakpoint on the last assistant message lets the conversation history cache as it grows. Cache hits are not all-or-nothing - a request can hit the first breakpoint, miss the second, and you pay read price for the first chunk plus write price for the second.
Use the 5-minute TTL (default) for high-traffic prompts called more than once per 5 minutes. The 5-minute TTL costs nothing extra on write. Use the 1-hour TTL for lower-traffic prompts that still benefit from caching - it costs 2x normal input on write but still 10% on read. If you get more than roughly 12 reads per hour, the 1-hour TTL pays for itself even with the higher write cost. The 5-minute TTL is rolling and refreshes on each hit, so consistently accessed prompts stay warm indefinitely.
Yes. Streaming responses report cache statistics in the final message_delta event. You get the same cache_creation_input_tokens and cache_read_input_tokens fields as with non-streaming calls. The cost savings are identical regardless of whether you stream the output.
Yes. The cache is scoped to your API organization, not per-user or per-request. Two different users hitting the same prompt prefix share the cache, which is excellent for maximizing hit rates on shared system prompts and tool definitions. For multi-tenant applications, put the org-wide system prompt in the first cache block and per-user context in the second - the shared layer achieves a much higher hit rate across all users.
Monitor the usage.cache_creation_input_tokens and usage.cache_read_input_tokens fields returned in every API response. Calculate your cache hit ratio as cache_read_tokens / (cache_read_tokens + cache_creation_tokens + uncached_input_tokens). A healthy production prompt should have a hit ratio above 0.85. Set an alert for when hit ratio drops below 0.6 to catch regressions early. The first request after a deploy or cache expiry will show cache creation tokens; subsequent requests within the TTL should show cache read tokens.
Read next
A production guide to Claude's extended thinking mode. Real cost math, TypeScript SDK code, and the tasks where reasoning tokens are worth 3x the spend.
12 min readHow to ship Claude's Batch API in production. 50% cost savings, TypeScript SDK code, JSONL request format, and the async architecture gotchas that bite at 100k requests.
11 min readFable 5 prompt caching economics: cache-write vs cache-read pricing, 5-minute vs 1-hour TTL break-even math, and worked agent-loop examples.
10 min readTechnical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
High-performance code editor built in Rust with native AI integration. Sub-millisecond input latency. Built-in assistant...
View ToolAnthropic's Python SDK for building production agent systems. Tool use, guardrails, agent handoffs, and orchestration. R...
View ToolAnthropic's smallest Claude 4.5 model. Near-frontier coding performance at one-third the cost of Sonnet 4 and up to 4-5x...
View ToolVercel's high-performance monorepo build system. Remote caching, task pipelines, and incremental builds. Drop into any p...
View ToolAutomatic reuse of cached context for substantial cost reduction.
Claude CodeReal-time prompt loop with history, completions, and multiline input.
Claude CodeConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI AgentsFable 5 prompt caching economics: cache-write vs cache-read pricing, 5-minute vs 1-hour TTL break-even math, and worked...

How to ship Claude's Batch API in production. 50% cost savings, TypeScript SDK code, JSONL request format, and the async...

A production guide to Claude's extended thinking mode. Real cost math, TypeScript SDK code, and the tasks where reasonin...

How to ship Claude's vision API in production. OCR, charts, UI audits, real cost numbers, TypeScript SDK code, and the g...

Master tool use in the Claude API. Schema design, retry logic, multi-step loops, and the failure modes that only show up...
Claude Fable 5 latency measured: 109 seconds to first token at max effort vs 1.4s for Sonnet 4.6. When slow is fine, whe...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.