
TL;DR
Cut Claude API spend by up to 90% with prompt caching. Real numbers, TypeScript SDK code, and the gotchas Anthropic's docs gloss over.
Every team I have looked at running Claude in production with a system prompt over 2k tokens is paying full freight on tokens they could be getting for a tenth of the price. Prompt caching has been generally available on the Anthropic API for over a year, and it is still the single biggest cost lever most apps have not pulled. Once you set it up correctly, cached input tokens cost about 10% of normal input tokens on read, and your time-to-first-token on a long-context call drops from seconds to a few hundred milliseconds.
This guide is the version of the docs I wish I had the first time I shipped a caching layer to production. We will cover what the cache actually does, when it pays off, the SDK code you should ship, the cache-invalidation footguns, and how to monitor hit rate so you actually realize the savings instead of assuming them.
We covered the basics in our Prompt Caching Explained video on YouTube. This post is the long-form, production-grade companion.
Anthropic's prompt cache is a server-side, ephemeral cache keyed on the exact byte sequence of your prompt prefix. When you mark a block with cache_control: { type: "ephemeral" }, the API stores the model's internal state at that breakpoint. The next request that arrives within the TTL with the same prefix up to that breakpoint hits the cache.
There are two TTL options:
Cache writes are slightly more expensive than a normal call. Cache reads are dramatically cheaper. The math is simple: if you reuse the same prefix more than once or twice within the TTL window, caching wins. If you do not, it loses.
What the docs do not loudly say:
The break-even is roughly: if a prefix is reused more than once or twice within five minutes, cache it. Specific scenarios where the ROI is obvious:
Where caching is overhead:
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
Here is a minimal but production-shaped example using the official Anthropic SDK. It caches a long system prompt and a static knowledge-base block, leaving the user message uncached.
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const SYSTEM_PROMPT = `You are a senior support engineer for Acme Cloud.
Answer using only the provided knowledge base. Cite section IDs.
[... 3000 more tokens of policy, tone, and examples ...]`;
const KNOWLEDGE_BASE = await loadKnowledgeBase(); // ~15k tokens, stable per deploy
export async function answer(userMessage: string) {
const response = await client.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 1024,
system: [
{
type: "text",
text: SYSTEM_PROMPT,
cache_control: { type: "ephemeral" },
},
{
type: "text",
text: KNOWLEDGE_BASE,
cache_control: { type: "ephemeral" },
},
],
messages: [{ role: "user", content: userMessage }],
});
// Inspect cache usage on every response
console.log({
cache_creation: response.usage.cache_creation_input_tokens,
cache_read: response.usage.cache_read_input_tokens,
input: response.usage.input_tokens,
output: response.usage.output_tokens,
});
return response;
}
Two breakpoints, two cache layers. The first request after a deploy pays the write cost on both blocks. Every subsequent request inside five minutes reads both at ~10% of input price.
For a multi-turn chat, you want a third breakpoint on the last assistant message so the growing conversation history caches as it goes:
const messages = history.map((m, i) => {
const isLastAssistant =
m.role === "assistant" && i === history.length - 1;
return {
role: m.role,
content: isLastAssistant
? [{ type: "text", text: m.content, cache_control: { type: "ephemeral" } }]
: m.content,
};
});
This is the pattern used by every production chat app shipping on Claude. It keeps the rolling conversation cached without burning a breakpoint per turn.
1. Whitespace and JSON ordering. If your system prompt is built from a template that re-serializes a JSON object, key ordering can change between requests and silently kill the cache. Lock down your serializer or feed strings, not objects.
2. Timestamps in system prompts. "Today's date is 2026-04-29" at the top of the system prompt is a cache killer for every request from the next day onward. Move it into the user message or a separate uncached block at the end of the system array.
3. Tool definitions count as part of the prefix. If you reorder tools, you invalidate the cache. Sort tools deterministically.
4. The 5-minute TTL is rolling. Each cache hit refreshes the TTL. A high-traffic prompt stays warm forever. A low-traffic one dies between requests, and you pay write cost again. For prompts you call less than once every five minutes but want kept warm, use the 1-hour TTL even though the write cost is 2x — the math still wins above ~12 reads in an hour.
5. Streaming responses still report cache stats. They arrive in the final message_delta event. Do not assume cached calls skip usage reporting.
6. Cache misses on "obviously identical" prompts. Almost always one of: trailing whitespace, a Unicode normalization difference, a model version change, or a different max_tokens. The cache key includes more than just the text.
The mistake teams make with RAG plus caching is caching the wrong thing. The retrieved chunks are usually the most dynamic part of the prompt. The instructions and tool list are the most static. Cache those.
A clean layering for a RAG agent:
You get two breakpoints on Layer 1 and Layer 2. Layer 3 is fresh per query, no breakpoint. Layer 4 is just the user message. This pattern routinely takes a 25k-token RAG call from $0.075 input cost to about $0.012 on cache hits, with sub-second time-to-first-token.
The most common failure mode is shipping caching, declaring victory, and never noticing your hit rate is 30% because some request path is breaking the prefix. You need observability from day one. Every response includes usage.cache_creation_input_tokens and usage.cache_read_input_tokens — log both, then aggregate.
A useful metric is cache hit ratio:
hit_ratio = cache_read_tokens / (cache_read_tokens + cache_creation_tokens + uncached_input_tokens)
A healthy production prompt should sit above 0.85. Below 0.6, something is breaking the prefix and you should investigate. We built CodeBurn specifically to surface this metric across runs, and the same pattern works inside any FinOps dashboard you already have. The 400-Dollar Overnight Bill post-mortem walks through what happens when you do not.
Set an alert when hit ratio drops below a threshold for any prompt template. Almost every cache regression we have shipped was caught this way: a deploy added a new dynamic field to the system prompt, and the alert fired within an hour.
The cache is scoped to your API organization, not per-user. Two users hitting the same prompt prefix share the cache. This is great for shared system prompts and tool definitions. It is dangerous for anything that should be user-isolated.
Concrete patterns:
cache_controlPrompt caching is the closest thing to a free lunch in the Claude API. It is also the easiest optimization to ship broken and never notice. Get the breakpoints right, monitor the hit ratio, and you will pay roughly an order of magnitude less for the same workload.
For more on optimizing Claude in production, see our writeups on tool use patterns and building MCP servers.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
High-performance code editor built in Rust with native AI integration. Sub-millisecond input latency. Built-in assistant...
View ToolAnthropic's Python SDK for building production agent systems. Tool use, guardrails, agent handoffs, and orchestration. R...
View ToolAnthropic's smallest Claude 4.5 model. Near-frontier coding performance at one-third the cost of Sonnet 4 and up to 4-5x...
View ToolAutomatic reuse of cached context for substantial cost reduction.
Claude CodeReal-time prompt loop with history, completions, and multiline input.
Claude CodeConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI Agents
Master tool use in the Claude API. Schema design, retry logic, multi-step loops, and the failure modes that only show up...

Build MCP servers that connect Claude to your databases, APIs, and tools. Architecture, TypeScript SDK code, debugging,...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.