Extended Thinking in Claude: When Deep Reasoning Pays For Itself

Developers Digest•April 29, 2026•12 min read

Claude API Anthropic SDK Extended Thinking Reasoning Cost Optimization

TL;DR

A production guide to Claude's extended thinking mode. Real cost math, TypeScript SDK code, and the tasks where reasoning tokens are worth 3x the spend.

Prompt Caching in the Claude API: A Production Guide

Cut Claude API spend by up to 90% with prompt caching. Real numbers, TypeScript SDK code, and the gotchas Anthropic's docs gloss over.

11 min read

Claude Batch API: Cutting Async Workload Costs In Half

How to ship Claude's Batch API in production. 50% cost savings, TypeScript SDK code, JSONL request format, and the async architecture gotchas that bite at 100k requests.

11 min read

Claude Vision API: Image Analysis At Production Scale

How to ship Claude's vision API in production. OCR, charts, UI audits, real cost numbers, TypeScript SDK code, and the gotchas that bite at 100k images a month.

13 min read

Extended Thinking Is A Power Tool, Not A Default

Extended thinking is the Claude feature that most teams either ignore or turn on for everything. Both are wrong. It is a budget item - every reasoning token is billed, and a single thinking call can spend 3 to 10x the tokens of a normal completion. Used on the wrong workload, it is a slow, expensive way to get the same answer. Used on the right workload, it is the difference between a model that ships a buggy refactor and one that catches the off-by-one before it merges.

This is the version of the docs I wish I had the first time I plugged thinking into a real product. We will cover what the mode actually does under the hood, the cost-benefit math at the token level, the SDK code you should ship, and the task patterns where the ROI is obvious versus the ones where you are setting cash on fire.

We walked through several live examples in our Extended Thinking Real-World Examples video on YouTube. This post is the long-form, production-grade companion.

What Extended Thinking Actually Does

Extended thinking gives Claude a private scratchpad. When you enable it, the model emits a stream of internal thinking blocks before it produces the user-visible response. Those blocks are real tokens - they go through the same transformer, they cost real money, and they are returned to you in the API response so you can inspect them. They are not shown to your user unless you choose to surface them.

The mechanism is closer to learned chain-of-thought than to a separate reasoning model. The same Claude model is doing the reasoning; you are just paying for the room to do it. Three things change versus a normal call:

Cost goes up. Thinking tokens bill at the same rate as output tokens. A task that spends 4k thinking tokens before a 500-token answer costs roughly 9x what the bare answer would.
Latency goes up. Time to first user-visible token is no longer milliseconds; it can be 3 to 15 seconds depending on the budget.
Quality on hard tasks goes up. Math, multi-step logic, code design, and debugging are dramatically better. Factual lookups and formatting are unchanged.

That last point is the whole game. If your task does not benefit from deliberation, thinking is pure overhead.

The Cost-Benefit Math, In Real Numbers

Let's put numbers to it. Assume Sonnet pricing of roughly $3 per million input tokens and $15 per million output tokens, with thinking tokens billed at the output rate.

A typical RAG-flavored coding-help call:

Input: 8k tokens (system + tools + retrieved chunks + user message)
Output: 600 tokens
Cost without thinking: 8000 x $3/1M + 600 x $15/1M, roughly $0.033

Same call with a 4k thinking budget:

Input: 8k tokens
Thinking: 4k tokens
Output: 600 tokens
Cost with thinking: 8000 x $3/1M + 4600 x $15/1M, roughly $0.093

Thinking tripled the bill. That is fine if it caught a bug that would have cost a developer 30 minutes of debugging. It is a disaster if the model was going to give the same answer anyway.

The break-even rule we use:

Use thinking when the cost of a wrong answer is greater than 5x the call cost. Code generation, architecture decisions, debugging, math.
Skip thinking when the cost of a wrong answer is small or easy to detect. Summarization, classification, formatting, factual extraction.
Selective thinking is the production sweet spot: turn it on only for the requests that need it, not for the whole endpoint.

For teams running thousands of these per day, eyeballing this math is not enough. We built CodeBurn precisely to surface thinking-token spend per route so you can see which prompts are paying for reasoning they don't need.

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

From the archive

GPT-5.4 for Developers: The Production Guide

Apr 29, 2026 • 12 min read

GPT-5.5-Codex in Production: What Actually Changes

Apr 29, 2026 • 9 min read

GPT-5.5 for Developers: A Production Field Guide

Apr 29, 2026 • 11 min read

DeepSeek R1, PPO, and GRPO Explained for Devs

Apr 29, 2026 • 12 min read

The TypeScript Code You Should Actually Ship

Here is a minimal but production-shaped extended-thinking call using the official Anthropic SDK. Note the explicit budget_tokens and the response handling for thinking blocks.

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

async function deepReason(userQuestion: string) {
  const response = await client.messages.create({
    model: "claude-sonnet-4-5",
    max_tokens: 8000,
    thinking: {
      type: "enabled",
      budget_tokens: 4000,
    },
    messages: [
      {
        role: "user",
        content: userQuestion,
      },
    ],
  });

  let thinking = "";
  let answer = "";

  for (const block of response.content) {
    if (block.type === "thinking") {
      thinking += block.thinking;
    } else if (block.type === "text") {
      answer += block.text;
    }
  }

  return {
    thinking,
    answer,
    usage: response.usage,
  };
}

A few non-obvious things this code captures:

budget_tokens is a soft cap on the thinking phase. The model can stop sooner if it concludes early. It will not exceed the budget.
max_tokens must be greater than budget_tokens. If you set them equal, you get a thinking-only response with no user-visible answer. Yes, people ship this bug.
The response is multi-block. You must iterate over content and dispatch on block.type. A single response.content[0].text will throw at runtime when thinking is enabled.

Prompts That Benefit From Thinking, And Prompts That Don't

The biggest mistake is assuming "harder prompt = more thinking helps." That is roughly true, but the shape matters more than the difficulty.

Tasks where thinking dramatically lifts quality:

Multi-step reasoning with intermediate decisions (planning a migration, designing a schema)
Adversarial debugging where the surface symptom is misleading
Math and formal logic where the model needs to track state across steps
Code review of nontrivial diffs where correctness depends on cross-file context
Constraint satisfaction problems (scheduling, resource allocation)

Tasks where thinking is overhead:

Lookup and summarization ("what does this paragraph say")
Structured extraction with a clear schema
Format conversion (markdown to JSON, SQL to ORM)
Classification with well-defined labels
Anything you would write a regex for

A useful heuristic: if you can write a deterministic test that checks the answer in under five lines of code, you probably do not need thinking.

Combining Thinking With Tool Use

Thinking and tool use are designed to work together, and this is where the real production wins live. The model can reason about which tool to call, call it, see the result inside a thinking block, reason about the result, and call another tool. This is exactly what an agent loop should do.

The mechanics are slightly subtle. When you append a tool_result to the conversation and re-invoke the model, the previous thinking block is preserved in the assistant turn. You must include it in the messages array on the next call, or the model loses its reasoning chain. The SDK helpfully returns thinking blocks with a signature field; pass them back unchanged.

async function agentTurn(messages: Anthropic.MessageParam[]) {
  return client.messages.create({
    model: "claude-sonnet-4-5",
    max_tokens: 8000,
    thinking: { type: "enabled", budget_tokens: 4000 },
    tools: TOOLS,
    messages,
  });
}

Strip thinking blocks before showing the conversation to a user. Keep them when you re-invoke the model. Mixing those up is the most common bug we see.

Production Gotchas Worth Pinning To Your Wall

1. Thinking tokens count toward your output rate limits. A 4k thinking budget plus a 1k answer is a 5k output call for rate-limit purposes. Plan capacity accordingly.

2. Thinking content is not deterministic. Two calls with the same prompt can produce wildly different reasoning. If you log thinking for debugging, do not assert on it in tests.

3. You cannot stream a thinking block as plain text. Streaming returns event types that include thinking_delta. If your frontend assumes all deltas are user text, you will leak the model's internal monologue into the UI. Filter on event type.

4. The 1-hour prompt cache and thinking interact cleanly. Thinking blocks themselves are not cached, but the input prefix is. A long system prompt plus tool definitions cached, then a thinking call on top, is the cheapest deep-reasoning configuration we have shipped.

5. Temperature matters less than you think. People crank temperature up to "encourage creativity" in thinking mode. In our A/B tests, temperature near zero with extended thinking outperforms higher-temperature thinking on every measurable axis except surface variety. Default to 0.

6. Empty thinking blocks happen. On easy questions the model sometimes emits a tiny or empty thinking block and then answers. This is normal. You still pay the request overhead but not the budget.

Selective Thinking: The Production Pattern

The pattern that delivers ROI in real apps is selective thinking - a router that decides whether the incoming request deserves reasoning. The cheapest router is a Haiku call with a one-line classifier. The most accurate is a small static heuristic plus a Haiku fallback.

async function shouldThink(userInput: string): Promise<boolean> {
  if (userInput.length < 80) return false;
  if (/^(summarize|format|convert|extract)/i.test(userInput)) return false;

  const classifier = await client.messages.create({
    model: "claude-haiku-4-5",
    max_tokens: 5,
    messages: [
      {
        role: "user",
        content: `Does this request require multi-step reasoning, debugging, or design? Answer "yes" or "no" only.\n\n${userInput}`,
      },
    ],
  });
  const text = classifier.content[0].type === "text" ? classifier.content[0].text : "";
  return /yes/i.test(text);
}

In production, this routing pattern typically cuts thinking-token spend by 60 to 80% with zero detectable quality loss on the easy bucket. The classifier costs fractions of a cent. The savings are in dollars per request.

Monitoring Thinking In Production

Three metrics you should chart from day one:

Thinking tokens per request, p50 / p95 / p99. Spikes are usually a prompt regression that confused the model.
Thinking budget utilization. If p99 hits the budget cap, you are clipping reasoning and probably losing quality.
Wrong-answer rate, with vs. without thinking. This is the only metric that justifies the spend. If it does not move, turn thinking off.

A useful side effect: thinking output is amazing debugging material. When a customer reports a weird answer, the thinking block usually tells you exactly which step the model got wrong. We log it (with PII scrubbing) and review failed requests against it. The 400-Dollar Overnight Bill post-mortem covers the flip side: thinking turned on for the wrong workload, no one watching the meter.

Production Checklist Before You Ship

budget_tokens set explicitly, less than max_tokens
Response parsed by iterating content and dispatching on block.type
Thinking blocks stripped from user-facing UI
Thinking blocks preserved when re-invoking with tool results
Selective routing in front of any high-traffic endpoint
Thinking-token spend tracked per route, alert on anomalies
Streaming handlers explicitly filter thinking_delta events
Temperature set to 0 unless you have an A/B test that justifies otherwise
Quality metric (not just cost) tied to the thinking decision

Extended thinking is one of the highest-leverage features in the Claude API on the right workload, and one of the easiest to set fire to your budget on the wrong one. Get the routing right, monitor the spend, and treat the thinking output as the gold mine of debugging data it is.

For more on optimizing Claude in production, see our writeups on prompt caching and tool use patterns.

Share

Suggest an editSave

Discuss this article on Twitter/X

Developers Digest

Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.

300+ videos30K+ GitHub stars50+ articles

Subscribe YouTube GitHub Twitter/X

Comments

Related Tools

AI Models

DeepSeek V3.2

DeepSeek's reasoning-first model built for agents. First model to integrate thinking directly into tool use. Ships along...

View Tool

AI Models

DeepSeek

Open-source reasoning models from China. DeepSeek-R1 rivals o1 on math and code benchmarks. V3 for general use. Fully op...

View Tool

AI ModelsDaily Driver

Claude

Anthropic's AI. Opus 4.6 for hard problems, Sonnet 4.6 for speed, Haiku 4.5 for cost. 200K context window. Best coding m...

View Tool

AI Models

ChatGPT

OpenAI's flagship. GPT-4o for general use, o3 for reasoning, Codex for coding. 300M+ weekly users. Tasks, agents, web br...

View Tool

Related Guides

Guide

Extended Thinking - Claude Code

Toggle with Alt+T. Claude reasons through complex problems before responding.

Claude Code

Guide

AI Agent Frameworks Compared: CrewAI vs LangGraph vs AutoGen vs Claude Code

Deep comparison of the top AI agent frameworks - architecture, code examples, strengths, weaknesses, and when to use each one.

AI Agents

Guide

Effort Levels - Claude Code

Low, medium, high, xhigh, and max for adaptive reasoning control.

Claude Code

Prompt Caching in the Claude API: A Production Guide

Claude Batch API: Cutting Async Workload Costs In Half

Claude Vision API: Image Analysis At Production Scale

Extended Thinking Is A Power Tool, Not A Default

What Extended Thinking Actually Does

The Cost-Benefit Math, In Real Numbers

GPT-5.4 for Developers: The Production Guide

GPT-5.5-Codex in Production: What Actually Changes

GPT-5.5 for Developers: A Production Field Guide

DeepSeek R1, PPO, and GRPO Explained for Devs

The TypeScript Code You Should Actually Ship

Prompts That Benefit From Thinking, And Prompts That Don't

Combining Thinking With Tool Use

Production Gotchas Worth Pinning To Your Wall

Selective Thinking: The Production Pattern

Monitoring Thinking In Production

Production Checklist Before You Ship

Comments

Related Tools

DeepSeek V3.2

DeepSeek

Claude

ChatGPT

Related Guides

Extended Thinking - Claude Code

AI Agent Frameworks Compared: CrewAI vs LangGraph vs AutoGen vs Claude Code

Effort Levels - Claude Code

Related Posts

Claude Batch API: Cutting Async Workload Costs In Half

Prompt Caching in the Claude API: A Production Guide

Claude Vision API: Image Analysis At Production Scale

Tool Use in the Claude API: Production Patterns for Reliable Agents

DeepSeek V4 Changes the Coding Agent Cost Equation

Agent Architecture: Building Multi-Step AI Workflows That Survive Production

Get Smarter About AI Dev

Prompt Caching in the Claude API: A Production Guide

Claude Batch API: Cutting Async Workload Costs In Half

Claude Vision API: Image Analysis At Production Scale

Extended Thinking Is A Power Tool, Not A Default

What Extended Thinking Actually Does

The Cost-Benefit Math, In Real Numbers

GPT-5.4 for Developers: The Production Guide

GPT-5.5-Codex in Production: What Actually Changes

GPT-5.5 for Developers: A Production Field Guide

DeepSeek R1, PPO, and GRPO Explained for Devs

The TypeScript Code You Should Actually Ship

Prompts That Benefit From Thinking, And Prompts That Don't

Combining Thinking With Tool Use

Production Gotchas Worth Pinning To Your Wall

Selective Thinking: The Production Pattern

Monitoring Thinking In Production

Production Checklist Before You Ship

Comments

Related Tools

DeepSeek V3.2

DeepSeek

Claude

ChatGPT

Related Guides

Extended Thinking - Claude Code

AI Agent Frameworks Compared: CrewAI vs LangGraph vs AutoGen vs Claude Code

Effort Levels - Claude Code

Related Posts

Claude Batch API: Cutting Async Workload Costs In Half

Prompt Caching in the Claude API: A Production Guide

Claude Vision API: Image Analysis At Production Scale

Tool Use in the Claude API: Production Patterns for Reliable Agents

DeepSeek V4 Changes the Coding Agent Cost Equation

Agent Architecture: Building Multi-Step AI Workflows That Survive Production

Get Smarter About AI Dev