
TL;DR
A production guide to Claude's extended thinking mode. Real cost math, TypeScript SDK code, and the tasks where reasoning tokens are worth 3x the spend.
Read next
Cut Claude API spend by up to 90% with prompt caching. Real numbers, TypeScript SDK code, and the gotchas Anthropic's docs gloss over.
11 min readHow to ship Claude's Batch API in production. 50% cost savings, TypeScript SDK code, JSONL request format, and the async architecture gotchas that bite at 100k requests.
11 min readHow to ship Claude's vision API in production. OCR, charts, UI audits, real cost numbers, TypeScript SDK code, and the gotchas that bite at 100k images a month.
13 min readExtended thinking is the Claude feature that most teams either ignore or turn on for everything. Both are wrong. It is a budget item - every reasoning token is billed, and a single thinking call can spend 3 to 10x the tokens of a normal completion. Used on the wrong workload, it is a slow, expensive way to get the same answer. Used on the right workload, it is the difference between a model that ships a buggy refactor and one that catches the off-by-one before it merges.
This is the version of the docs I wish I had the first time I plugged thinking into a real product. We will cover what the mode actually does under the hood, the cost-benefit math at the token level, the SDK code you should ship, and the task patterns where the ROI is obvious versus the ones where you are setting cash on fire.
We walked through several live examples in our Extended Thinking Real-World Examples video on YouTube. This post is the long-form, production-grade companion.
Extended thinking gives Claude a private scratchpad. When you enable it, the model emits a stream of internal thinking blocks before it produces the user-visible response. Those blocks are real tokens - they go through the same transformer, they cost real money, and they are returned to you in the API response so you can inspect them. They are not shown to your user unless you choose to surface them.
The mechanism is closer to learned chain-of-thought than to a separate reasoning model. The same Claude model is doing the reasoning; you are just paying for the room to do it. Three things change versus a normal call:
That last point is the whole game. If your task does not benefit from deliberation, thinking is pure overhead.
Let's put numbers to it. Assume Sonnet pricing of roughly $3 per million input tokens and $15 per million output tokens, with thinking tokens billed at the output rate.
A typical RAG-flavored coding-help call:
Same call with a 4k thinking budget:
Thinking tripled the bill. That is fine if it caught a bug that would have cost a developer 30 minutes of debugging. It is a disaster if the model was going to give the same answer anyway.
The break-even rule we use:
For teams running thousands of these per day, eyeballing this math is not enough. We built CodeBurn precisely to surface thinking-token spend per route so you can see which prompts are paying for reasoning they don't need.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Here is a minimal but production-shaped extended-thinking call using the official Anthropic SDK. Note the explicit budget_tokens and the response handling for thinking blocks.
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
async function deepReason(userQuestion: string) {
const response = await client.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 8000,
thinking: {
type: "enabled",
budget_tokens: 4000,
},
messages: [
{
role: "user",
content: userQuestion,
},
],
});
let thinking = "";
let answer = "";
for (const block of response.content) {
if (block.type === "thinking") {
thinking += block.thinking;
} else if (block.type === "text") {
answer += block.text;
}
}
return {
thinking,
answer,
usage: response.usage,
};
}
A few non-obvious things this code captures:
budget_tokens is a soft cap on the thinking phase. The model can stop sooner if it concludes early. It will not exceed the budget.max_tokens must be greater than budget_tokens. If you set them equal, you get a thinking-only response with no user-visible answer. Yes, people ship this bug.content and dispatch on block.type. A single response.content[0].text will throw at runtime when thinking is enabled.The biggest mistake is assuming "harder prompt = more thinking helps." That is roughly true, but the shape matters more than the difficulty.
Tasks where thinking dramatically lifts quality:
Tasks where thinking is overhead:
A useful heuristic: if you can write a deterministic test that checks the answer in under five lines of code, you probably do not need thinking.
Thinking and tool use are designed to work together, and this is where the real production wins live. The model can reason about which tool to call, call it, see the result inside a thinking block, reason about the result, and call another tool. This is exactly what an agent loop should do.
The mechanics are slightly subtle. When you append a tool_result to the conversation and re-invoke the model, the previous thinking block is preserved in the assistant turn. You must include it in the messages array on the next call, or the model loses its reasoning chain. The SDK helpfully returns thinking blocks with a signature field; pass them back unchanged.
async function agentTurn(messages: Anthropic.MessageParam[]) {
return client.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 8000,
thinking: { type: "enabled", budget_tokens: 4000 },
tools: TOOLS,
messages,
});
}
Strip thinking blocks before showing the conversation to a user. Keep them when you re-invoke the model. Mixing those up is the most common bug we see.
1. Thinking tokens count toward your output rate limits. A 4k thinking budget plus a 1k answer is a 5k output call for rate-limit purposes. Plan capacity accordingly.
2. Thinking content is not deterministic. Two calls with the same prompt can produce wildly different reasoning. If you log thinking for debugging, do not assert on it in tests.
3. You cannot stream a thinking block as plain text. Streaming returns event types that include thinking_delta. If your frontend assumes all deltas are user text, you will leak the model's internal monologue into the UI. Filter on event type.
4. The 1-hour prompt cache and thinking interact cleanly. Thinking blocks themselves are not cached, but the input prefix is. A long system prompt plus tool definitions cached, then a thinking call on top, is the cheapest deep-reasoning configuration we have shipped.
5. Temperature matters less than you think. People crank temperature up to "encourage creativity" in thinking mode. In our A/B tests, temperature near zero with extended thinking outperforms higher-temperature thinking on every measurable axis except surface variety. Default to 0.
6. Empty thinking blocks happen. On easy questions the model sometimes emits a tiny or empty thinking block and then answers. This is normal. You still pay the request overhead but not the budget.
The pattern that delivers ROI in real apps is selective thinking - a router that decides whether the incoming request deserves reasoning. The cheapest router is a Haiku call with a one-line classifier. The most accurate is a small static heuristic plus a Haiku fallback.
async function shouldThink(userInput: string): Promise<boolean> {
if (userInput.length < 80) return false;
if (/^(summarize|format|convert|extract)/i.test(userInput)) return false;
const classifier = await client.messages.create({
model: "claude-haiku-4-5",
max_tokens: 5,
messages: [
{
role: "user",
content: `Does this request require multi-step reasoning, debugging, or design? Answer "yes" or "no" only.\n\n${userInput}`,
},
],
});
const text = classifier.content[0].type === "text" ? classifier.content[0].text : "";
return /yes/i.test(text);
}
In production, this routing pattern typically cuts thinking-token spend by 60 to 80% with zero detectable quality loss on the easy bucket. The classifier costs fractions of a cent. The savings are in dollars per request.
Three metrics you should chart from day one:
A useful side effect: thinking output is amazing debugging material. When a customer reports a weird answer, the thinking block usually tells you exactly which step the model got wrong. We log it (with PII scrubbing) and review failed requests against it. The 400-Dollar Overnight Bill post-mortem covers the flip side: thinking turned on for the wrong workload, no one watching the meter.
budget_tokens set explicitly, less than max_tokenscontent and dispatching on block.typethinking_delta eventsExtended thinking is one of the highest-leverage features in the Claude API on the right workload, and one of the easiest to set fire to your budget on the wrong one. Get the routing right, monitor the spend, and treat the thinking output as the gold mine of debugging data it is.
For more on optimizing Claude in production, see our writeups on prompt caching and tool use patterns.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
DeepSeek's reasoning-first model built for agents. First model to integrate thinking directly into tool use. Ships along...
View ToolOpen-source reasoning models from China. DeepSeek-R1 rivals o1 on math and code benchmarks. V3 for general use. Fully op...
View ToolAnthropic's AI. Opus 4.6 for hard problems, Sonnet 4.6 for speed, Haiku 4.5 for cost. 200K context window. Best coding m...
View ToolOpenAI's flagship. GPT-4o for general use, o3 for reasoning, Codex for coding. 300M+ weekly users. Tasks, agents, web br...
View ToolToggle with Alt+T. Claude reasons through complex problems before responding.
Claude CodeDeep comparison of the top AI agent frameworks - architecture, code examples, strengths, weaknesses, and when to use each one.
AI AgentsLow, medium, high, xhigh, and max for adaptive reasoning control.
Claude Code
How to ship Claude's Batch API in production. 50% cost savings, TypeScript SDK code, JSONL request format, and the async...

Cut Claude API spend by up to 90% with prompt caching. Real numbers, TypeScript SDK code, and the gotchas Anthropic's do...

How to ship Claude's vision API in production. OCR, charts, UI audits, real cost numbers, TypeScript SDK code, and the g...

Master tool use in the Claude API. Schema design, retry logic, multi-step loops, and the failure modes that only show up...

DeepSeek V4 is trending because it is close enough to frontier coding models at a much lower token price. The real quest...

A practical architecture for multi-step Claude agents. Loop patterns, state management, error recovery, and the producti...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.