TL;DR
Everything you need to ship Claude Fable 5 in production - from the API surface changes and adaptive thinking defaults to rate limit strategy, streaming latency, and the June 15 deprecation deadline for older models.
Read next
The defensive patterns that keep Claude integrations alive in production. Retry shapes, backoff with jitter, circuit breakers, fallback chains, and the observability you need to debug at 3am.
10 min readHow to ship Claude's Batch API in production. 50% cost savings, TypeScript SDK code, JSONL request format, and the async architecture gotchas that bite at 100k requests.
11 min readCut Claude API spend by up to 90% with prompt caching. Real numbers, TypeScript SDK code, and the gotchas Anthropic's docs gloss over.
11 min readClaude Fable 5 launched on June 9, 2026 as Anthropic's most capable widely released model. The model ID is claude-fable-5, pricing is $10 per million input tokens and $50 per million output tokens, and the API surface has a handful of genuine breaking changes that will bite you silently if you miss them.
This guide covers what actually changed compared to Opus 4.8, the gotchas production teams are running into, and a concrete migration checklist before the June 15 deprecation deadline for older Claude 4 models.
Last updated: June 10, 2026
The headline specs are straightforward: 1M token context window, 128k max output tokens per request, $10/$50 per million input/output tokens. According to the official models overview, Fable 5 is available on the Claude API, Claude Platform on AWS, Amazon Bedrock, Vertex AI, and Microsoft Foundry from day one.
The table below summarizes where Fable 5 diverges from Opus 4.8:
| Parameter | Opus 4.8 | Fable 5 |
|---|---|---|
| Model ID | claude-opus-4-8 | claude-fable-5 |
| Pricing (input / output) | $5 / $25 per MTok | $10 / $50 per MTok |
| Context window | 1M tokens | 1M tokens |
| Max output | 128k tokens | 128k tokens |
thinking: {type: "disabled"} | Accepted | 400 error |
thinking: {type: "enabled", budget_tokens: N} | 400 error | 400 error |
temperature, top_p, top_k | Rejected (400) | Rejected (400) |
| Adaptive thinking | Optional (omit to disable) | Always on |
| Raw thinking content | Optional | Never returned |
| Data retention | Standard | 30-day minimum (Covered Model) |
The key API-level change: on Fable 5, adaptive thinking is always on and cannot be disabled. If your existing code passes thinking: {"type": "disabled"}, that call will return a 400 on Fable 5. The fix is to remove the thinking parameter entirely rather than passing disabled. This is documented on the Fable 5 introduction page.
Everything else from Opus 4.7/4.8 carries over unchanged: budget_tokens is still rejected, sampling parameters still 400, assistant-turn prefills still 400.
Adaptive thinking is mandatory on Fable 5 - there is no way to turn it off. This matters for two reasons.
First: thinking content is omitted by default. The thinking blocks still appear in the stream but their text is empty unless you explicitly set thinking: {"type": "adaptive", "display": "summarized"}. If you stream reasoning to users or log it for debugging, you will see what looks like a long pause before output begins - the model is thinking, but the text is empty. Add display: "summarized" to restore visible progress.
# Opus 4.8 - you could disable thinking
client.messages.create(
model="claude-opus-4-8",
thinking={"type": "disabled"}, # worked fine
...
)
# Fable 5 - thinking cannot be disabled; omit the parameter entirely
client.messages.create(
model="claude-fable-5",
# No thinking parameter needed - adaptive is the default
output_config={"effort": "high"},
...
)
# If you surface reasoning to users, opt into summarized display
client.messages.create(
model="claude-fable-5",
thinking={"type": "adaptive", "display": "summarized"},
...
)
Second: what this means for your prompts. Because thinking is always running, aggressive "think step by step" instructions that were written to elicit reasoning on earlier models are now redundant. More importantly, the effort parameter matters more than on any prior model. Fable 5 calibrates deeply to effort level - start at high as your default, use xhigh for coding and agentic work, and reserve max for genuinely hard tasks where correctness justifies the cost.
The model self-moderates how much to think based on task complexity under adaptive mode. You do not pay for thinking tokens on every request uniformly - simpler requests use less thinking compute. This is actually a cost win compared to models where you had to manually tune budget_tokens per route.
Fable 5 pricing at $10/$50 per MTok is 2x the cost of Opus 4.8. If you are running concurrent agent pipelines, that multiplier compounds fast. A few patterns that hold up in production:
Tier your models by task complexity. Not every agent call needs Fable 5. Route simple classification, extraction, and summarization to Haiku 4.5 ($1/$5 per MTok) or Sonnet 4.6 ($3/$15 per MTok). Reserve Fable 5 for long-horizon reasoning, complex code generation, and the final synthesis step in multi-step pipelines.
Use the Batch API for non-latency-sensitive work. The Message Batches API gives a 50% cost reduction across all models including Fable 5. On the Batch API, Fable 5 also supports up to 300k output tokens using the output-300k-2026-03-24 beta header. If your pipeline can tolerate async processing (most reporting, analysis, and enrichment flows can), batch is the right default.
Handle 429s defensively. The Anthropic SDK auto-retries 429 and 5xx with exponential backoff (max_retries=2 by default). For high-concurrency agent loops, you may want to set max_retries=5 and implement queue-level backpressure rather than hammering retries per-call. The retry-after header on 429 responses tells you exactly how long to wait.
import anthropic
# For agent pipelines: increase retries and implement backpressure
client = anthropic.Anthropic(max_retries=5)
# Check rate limit headers on any response
response = client.messages.with_raw_response.create(
model="claude-fable-5",
max_tokens=16000,
messages=[{"role": "user", "content": prompt}]
)
remaining = response.headers.get("x-ratelimit-remaining-tokens")
limit = response.headers.get("x-ratelimit-limit-tokens")
Token counting before expensive calls. For long-context use cases approaching the 1M window, run client.messages.count_tokens() before the main call to catch context overflows before they happen and to estimate cost. Fable 5 and Opus 4.8 count tokens differently from each other - re-baseline against claude-fable-5 rather than reusing Opus 4.8 estimates.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Jun 10, 2026 • 8 min read
Jun 10, 2026 • 9 min read
Jun 10, 2026 • 8 min read
Jun 10, 2026 • 7 min read
The SDK enforces streaming for high-max_tokens requests. Practically, anything above roughly 16k output tokens requires stream=True or the SDK will raise a ValueError before even hitting the API. With 128k max output, you will almost always be streaming on Fable 5 for substantive tasks.
The practical cost: streamed output on Fable 5 at $50 per million output tokens runs about 5 cents per thousand output tokens. A 10k-token code generation costs roughly $0.50 in output alone. That is not a reason to avoid Fable 5, but it is worth instrumenting per-call output token counts so you can see where spend is concentrated.
When non-streaming wins. For short, latency-sensitive queries (classification, entity extraction, short answers), non-streaming with a modest max_tokens cap (256-1024) avoids stream overhead. The latency difference is real at low token counts. If max_tokens is under 16k and the task is genuinely bounded, non-streaming is fine.
The get_final_message() pattern. You do not need to handle individual stream events to get timeout protection from streaming. Use .stream() with .get_final_message() - you get streaming's timeout benefits without the event loop complexity:
with client.messages.stream(
model="claude-fable-5",
max_tokens=64000,
messages=[{"role": "user", "content": prompt}]
) as stream:
message = stream.get_final_message()
print(message.usage.output_tokens)
For user-facing applications where you want to show output as it arrives, iterate stream.text_stream instead. For internal pipelines where only the final result matters, .get_final_message() is simpler.
Fable 5 is a Covered Model under Anthropic's data retention policy. According to Anthropic's documentation, this means a 30-day data retention minimum applies and zero data retention (ZDR) is not available for Fable 5 requests.
For teams with strict data handling requirements - healthcare, finance, legal - this matters. Zero data retention was the way to ensure Anthropic did not retain any request or response data, even transiently. Fable 5 removes that option. If ZDR is a hard requirement, you need to stay on Opus 4.8 or earlier models that support it, or route sensitive traffic to those models while using Fable 5 for non-sensitive workloads.
For enterprise contracts, check your data processing agreement. The 30-day minimum applies to Anthropic-operated platforms. If you are running via Amazon Bedrock or Vertex AI, those platforms have their own data handling terms that may differ.
Fable 5 introduces a first-class fallback mechanism in the API. When Fable 5's safety classifiers decline a request, the response comes back as a successful HTTP 200 with stop_reason: "refusal" - not a 4xx error. Anthropic's documentation notes that most refused requests can be served by another Claude model, and the API supports two fallback patterns.
Server-side fallback (beta on the Claude API and Claude Platform on AWS): pass a fallbacks parameter and the API retries automatically on a specified model if Fable 5 refuses.
Client-side fallback: the Python, TypeScript, Go, Java, and C# SDKs include middleware that detects stop_reason: "refusal" and retries on a fallback model. This works on all platforms including Bedrock and Vertex AI.
The practical implication for your error handling: if you built your refusal handling around catching HTTP errors, you need to add a check for stop_reason == "refusal" on successful responses from Fable 5. The model-changed behavior affects any code path that currently only branches on end_turn and tool_use.
response = client.messages.create(
model="claude-fable-5",
max_tokens=16000,
messages=[{"role": "user", "content": prompt}]
)
if response.stop_reason == "refusal":
# Fable 5 declined - try fallback model
response = client.messages.create(
model="claude-opus-4-8",
max_tokens=16000,
messages=[{"role": "user", "content": prompt}]
)
elif response.stop_reason == "end_turn":
# Normal completion
pass
When you use the SDK middleware or server-side fallback, Anthropic provides a fallback credit to offset the prompt-cache cost of switching models mid-request. See Fallback credit in the docs.
Prompt caching works on Fable 5 the same as on earlier models - still a prefix match, still up to 4 cache_control breakpoints, still a minimum cacheable prefix of roughly 2048 tokens (Fable 5 is in the Sonnet 4.6 / Fable 5 tier). But with a 1M context window, a few new failure modes appear.
Cache invalidation at scale. If your system prompt references a timestamp, UUIDs, or any per-request dynamic content early in the prompt, you lose all caching downstream. With 1M context and $10/MTok input pricing, a single un-cached read of a 500k-token context costs $5 in input alone. Before scaling up context window usage, audit your prompt assembly pipeline for silent invalidators: datetime.now() in system prompts, unsorted JSON serialization, and per-user content spliced into the stable prefix are the most common culprits.
Token counting shifted from Opus 4.8. Fable 5 and Opus 4.8 tokenize differently. The same input produces a different token count. If you have cost estimators, rate-limit thresholds, or compaction triggers calibrated against Opus 4.8 token counts, re-baseline them by running client.messages.count_tokens(model="claude-fable-5", ...) against a representative sample before going to production.
128k output on the Batch API. The Batch API supports up to 300k output tokens for Fable 5 with the output-300k-2026-03-24 beta header. This is substantially more than the 128k cap on the synchronous API. For document generation, long-form analysis, or bulk code output that can tolerate async processing, the Batch API gives you both a larger output ceiling and the 50% cost discount.
According to Anthropic's deprecation page, claude-opus-4-20250514 and claude-sonnet-4-20250514 are deprecated as of April 14, 2026, with a retirement date of June 15, 2026. That is the near-term deadline that affects production code. Fable 5 is a separate, newer model - but if you are still running the Claude 4 models from the May 2025 snapshot, June 15 is your hard cutoff.
thinking: {"type": "disabled"} - this returns a 400 on Fable 5. Omit the thinking parameter instead.stop_reason == "refusal" handling - Fable 5 safety classifiers can decline requests as successful 200 responses. Code that only checks for end_turn will silently miss refusals.max_tokens above 16k - the SDK enforces this for both Fable 5 and Opus 4.8.claude-fable-5 is the correct ID. Do not append date suffixes.thinking.display: "summarized" if you surface reasoning to users or log it - the default "omitted" returns empty thinking text on Fable 5 and Opus 4.7/4.8.effort per route - Fable 5 calibrates deeply to effort level. Start at high, use xhigh for coding/agentic, sweep your eval set before locking values.| Model | Retirement | Replacement |
|---|---|---|
claude-opus-4-20250514 | June 15, 2026 | claude-opus-4-8 |
claude-sonnet-4-20250514 | June 15, 2026 | claude-sonnet-4-6 |
For broader migration context, the Anthropic migration guide covers breaking changes from every prior model version.
| Resource | URL |
|---|---|
| Fable 5 introduction | platform.claude.com/docs/en/about-claude/models/introducing-claude-fable-5-and-claude-mythos-5 |
| Models overview and pricing | platform.claude.com/docs/en/about-claude/models/overview |
| Model deprecations | platform.claude.com/docs/en/about-claude/model-deprecations |
| Migration guide | platform.claude.com/docs/en/about-claude/models/migration-guide |
| Adaptive thinking | platform.claude.com/docs/en/build-with-claude/adaptive-thinking |
| Refusals and fallback | platform.claude.com/docs/en/build-with-claude/refusals-and-fallback |
| Rate limits | platform.claude.com/docs/en/api/rate-limits |
| Prompt caching | platform.claude.com/docs/en/build-with-claude/prompt-caching |
If you are building production integrations on top of the Claude API, see the related guides on error handling and reliability, the Batch API for high-volume workloads, and prompt caching strategy.
The model ID is claude-fable-5. Do not append date suffixes - the ID is complete as-is. Claude Fable 5 launched on June 9, 2026.
No. Adaptive thinking is always on for Fable 5. Passing thinking: {"type": "disabled"} returns a 400 error. To prevent thinking content from appearing in your responses, simply omit the thinking parameter (thinking will still occur but text will be empty). To receive summarized thinking output, set thinking: {"type": "adaptive", "display": "summarized"}.
Fable 5 is priced at $10 per million input tokens and $50 per million output tokens, according to the official pricing page. The Batch API gives a 50% discount on these rates.
No. Fable 5 is a Covered Model with a 30-day data retention minimum. Zero data retention is not available for Fable 5 requests. If ZDR is required, use Claude Opus 4.8 or earlier models that support it.
June 15, 2026 is the retirement date for claude-opus-4-20250514 and claude-sonnet-4-20250514. Requests to these models will fail after that date. Migrate to claude-opus-4-8 and claude-sonnet-4-6 respectively.
Yes. Prompt caching works on Fable 5 with the same mechanics as earlier models - prefix match, up to 4 breakpoints, minimum ~2048 cacheable tokens. The input cost at $10/MTok makes caching more valuable than on earlier models, but also makes cache invalidation bugs more expensive.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Anthropic's Python SDK for building production agent systems. Tool use, guardrails, agent handoffs, and orchestration. R...
View ToolAnthropic's agentic coding CLI. Runs in your terminal, edits files autonomously, spawns sub-agents, and maintains memory...
View ToolVercel's generative UI tool. Describe a component, get production-ready React code with shadcn/ui and Tailwind. Iterate...
View ToolHigh-performance code editor built in Rust with native AI integration. Sub-millisecond input latency. Built-in assistant...
View ToolBeat the August 2026 Assistants API sunset. Paste old code, get Responses API.
View AppGenerate brand systems, launch copy, and reusable creative direction from one brief.
View AppDocument API key ownership, rotation context, and integration notes without storing secrets.
View AppStage, commit, branch, and open PRs without leaving the session.
Claude CodeFull GitHub CLI support for automated PR and issue workflows.
Claude CodeManaged scheduling on Anthropic infrastructure with API and GitHub triggers.
Claude Code
The defensive patterns that keep Claude integrations alive in production. Retry shapes, backoff with jitter, circuit bre...

How to ship Claude's Batch API in production. 50% cost savings, TypeScript SDK code, JSONL request format, and the async...

Cut Claude API spend by up to 90% with prompt caching. Real numbers, TypeScript SDK code, and the gotchas Anthropic's do...

A production guide to Claude's extended thinking mode. Real cost math, TypeScript SDK code, and the tasks where reasonin...

Master tool use in the Claude API. Schema design, retry logic, multi-step loops, and the failure modes that only show up...
Fable 5 ships with safety classifiers that route flagged requests away from the model. In production you need to handle...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.