The Fable 5 Moment
29 partsTL;DR
Fable 5 1M context workflows that actually work: whole-repo reviews, log archaeology, multi-doc synthesis - plus the honest math on when RAG still wins.
Read next
A practical playbook for running Claude Fable 5 as the orchestrator over Sonnet and Haiku workers, with verified cost math on when the premium pays off.
10 min readFable 5 posts an 80.3% SWE-Bench Pro score and costs 2x Opus 4.8 - here is the task-profile scoring guide that tells you when the premium pays off.
7 min readFable 5 effort levels explained: what low, medium, high, xhigh, and max actually change, which models support each level, and how effort drives your token bill.
10 min readLast updated: June 11, 2026
Claude Fable 5 launched on June 9 with a 1M token context window as the default, not an opt-in beta, and no long-context price premium - a 900K-token request bills at the same per-token rate as a 9K one. That combination changes which workflows are practical. But "1M tokens" is a spec, not a workflow, and the gap between the two is where teams will waste money this month. This guide works through three patterns that hold up in practice, with cost math attached, and is honest about where retrieval still beats stuffing the window.
The specs, from Anthropic's models overview: 1M token context, 128K max output, $10 per million input tokens and $50 per million output. The context windows documentation confirms the 1M maximum is also the default on the Claude API, and a single request can include up to 600 images or PDF pages.
Two caveats before you size anything:
The tokenizer counts differently. Fable 5 uses the tokenizer introduced with Opus 4.7, which produces roughly 30% more tokens for the same text than pre-4.7 models (the pricing docs say up to 35%). Anthropic's own tooltip pegs 1M tokens at roughly 555K words or about 2.5M unicode characters. If your sizing intuition was built on older Claude models, re-baseline it.
The usable envelope is smaller than the spec. In Claude Code specifically, Verdent's analysis of the 1M window puts the practical budget at about 830K tokens once the auto-compaction buffer and usage thresholds are accounted for. Plan around 800K, not a million.
Simon Willison's release-day assessment sets the temperament: "slow, expensive" but "the challenge is finding tasks that it can't do." The long-context tier is an async tool. Every workflow below assumes you are not sitting there watching a spinner.
The classic 1M-context pitch is loading an entire codebase and asking questions that span it: dead code, dependency cycles, inconsistent error handling across modules. This works, but only if you size and cache deliberately.
Step 1: Measure before you load. Using the documented ~2.5 characters per token on the new tokenizer:
git ls-files | grep -vE 'node_modules|dist|build|\.lock|\.svg|fixtures' \
| xargs wc -c | tail -1
# divide total bytes by ~2.5 for a rough token estimate
Verdent's exclusion list matches what we have seen: never load node_modules/, build output, lockfiles, generated protobuf or GraphQL code, or large test fixtures. They are token-dense and signal-poor.
Step 2: Do the cache math. Say the repo lands at 600K tokens and you want a ten-question architecture review session. Per the prompt caching docs, a 1-hour cache write runs 2x base input and reads run 0.1x:
That is roughly a 70% input-cost reduction for the session, before output tokens either way. The docs note the 1-hour cache pays for itself after two reads; the 5-minute tier ($12.50/MTok write) pays off after one. Fable 5's minimum cacheable prompt is 512 tokens on the Claude API, so even small stable preambles cache.
Step 3: Use Batch for the overnight pass. A non-interactive full-repo review (run the audit, file the report) qualifies for the Batch API's 50% discount - $5/$25 per MTok - which turns the $6.00 cold load into $3.00. For recurring scheduled reviews, batch plus caching is the only configuration where whole-repo passes stay cheap enough to run routinely. We covered the per-task framing in more depth in the Fable 5 cost-per-task analysis.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Jun 11, 2026 • 10 min read
Jun 11, 2026 • 10 min read
Jun 11, 2026 • 10 min read
Jun 11, 2026 • 8 min read
This is the workflow where intuitions are most wrong. A million tokens sounds like "all the logs." It is not. At ~2.5 characters per token, the window holds roughly 2.5MB of raw text - on the order of 16,000 to 17,000 typical 150-byte log lines. A busy service emits that in minutes.
So the pattern is filter-then-load, not load-everything:
# cut to the incident window and suspect services first
grep -h "2026-06-10T2[2-3]" logs/api-*.log logs/worker-*.log \
| grep -vE 'healthz|heartbeat|GET /metrics' > incident.log
wc -c incident.log # bytes / 2.5 ~= tokens
What the long window buys you is not capacity for everything - it is that the entire filtered slice fits in one shot. Cross-service timeline reconstruction ("the worker retries started 40 seconds before the API 502s, and both correlate with this deploy marker") is exactly the reasoning that chunked retrieval breaks, because the causal chain spans chunks no single query retrieves together. Anthropic's launch post claims Fable 5 "stays focused across millions of tokens in long-running tasks" - a vendor claim, but incident reconstruction is where that focus is most visibly useful.
One honest limit: this is a single-session pattern. If your incident review spans days and the log corpus keeps growing, you are back in retrieval territory (more on that below).
The third pattern is loading a document set whole: an RFC plus the four design docs it supersedes, a vendor contract stack, or six to eight research papers (the pricing docs estimate a 500KB research PDF at ~125K tokens, so eight of those is the whole window). The 600-page-per-request PDF cap is the operative limit before tokens are.
Anthropic's own guidance here predates Fable 5 and still holds. The contextual retrieval post says it plainly: if your knowledge base is under 200K tokens (about 500 pages), skip RAG entirely and put it all in the prompt, with caching making that "significantly faster and more cost-effective" - they cite latency improvements over 2x and cost reductions up to 90% from caching alone.
What Fable 5's tier adds is the 200K-to-830K band: document sets that were previously forced into RAG purely by capacity now fit in one request. Synthesis tasks ("where do these five specs contradict each other") benefit most, because contradictions are relational facts that live between documents, and retrieval pipelines surface documents one relevance-ranked chunk at a time. For the mechanics of structuring large prompts, our context engineering guide covers ordering, stable-prefix layout, and cache breakpoint placement.
Latency is real. Willison called the model "a beast" that is slow and expensive, and long-input requests compound that: processing 800K input tokens takes meaningfully longer than 8K regardless of model. Treat full-window calls as async jobs with retries, not interactive turns. If you need fast iterative chat over a big codebase, Opus 4.8 with tighter context is often the better tool.
Context rot is acknowledged, measured, and not solved. Anthropic's own docs state that "as token count grows, accuracy and recall degrade," and their context engineering post frames attention as a finite budget that every token depletes. Verdent's roundup of independent testing found degradation starting around 400K tokens and retrieval becoming unreliable past 600K on Sonnet 4.6, with a working heuristic of about 2% effectiveness loss per 100K tokens. The best published long-context retrieval number in that set is Opus 4.6 at 78.3% on 8-needle MRCR at 1M tokens - strong, and still means missed needles. No equivalent independent Fable 5 figure existed in anything we could fetch this week; until one does, assume the back half of the window is softer than the front and put your highest-value content early.
Cost compounds per turn. Every conversational turn re-sends the whole context. An uncached 800K-token context costs $8.00 of input per turn. Caching is not an optimization here; it is the difference between viable and absurd. And if generation overruns the window, requests on 4.5+ models stop with stop_reason: "model_context_window_exceeded" rather than erroring - handle it.
The honest decision table, synthesized from Anthropic's guidance and Verdent's analysis:
| Situation | Load the window | Use retrieval |
|---|---|---|
| Corpus size | Under ~830K tokens | Over ~830K tokens |
| Query pattern | One deep session, many questions | High-frequency queries over weeks |
| Content change rate | Static snapshot | Frequently updated (cache invalidation kills you) |
| Question shape | Cross-cutting, relational | Pointy, lookup-style |
| Session shape | Single session | Multi-session, many users |
The per-query economics are the part people skip. A cached full-window read still costs about $0.80 in input per query at 800K tokens. A RAG pipeline retrieving 5K relevant tokens costs about $0.05 uncached. At hundreds of queries a day, retrieval wins by an order of magnitude even before latency. And retrieval itself has gotten better: Anthropic's contextual retrieval technique cuts retrieval failure rates by 49% (combined with BM25) and 67% with reranking. If your workload is lookup-shaped, a well-built RAG pipeline is not the legacy option - it is the right one.
The hybrid that Anthropic's engineering team actually recommends is just-in-time loading: keep lightweight identifiers in context, pull full content on demand, and use sub-agents that return condensed summaries instead of raw exploration. That pattern, plus compaction (in beta for Fable 5), is how the long-horizon agent harnesses get multi-day runs out of a finite window.
No. Anthropic's pricing docs state that Fable 5, Opus 4.8, and Sonnet 4.6 include the full 1M window at standard per-token rates, with caching and batch discounts applying across the whole window. The cost driver is volume, not a surcharge: 800K input tokens is $8.00 at the base rate regardless.
Anthropic's docs estimate 1M tokens at roughly 555K words or 2.5M characters on the new tokenizer. In Claude Code, plan for ~830K usable tokens after the auto-compaction buffer. As a rough rule, divide your repo's byte count (after excluding vendored deps, lockfiles, and generated code) by 2.5.
Not uniformly. Anthropic's docs acknowledge context rot, and independent testing collected by Verdent found measurable degradation from ~400K tokens with a ~2% effectiveness loss per 100K added. Put critical content early, and verify long-context outputs the same way you would verify a junior engineer's first pass.
Only for corpora under ~830K tokens that are static, queried in concentrated sessions, and benefit from cross-document reasoning. High-frequency lookup workloads over large or changing corpora stay cheaper and faster on retrieval, especially with contextual retrieval cutting failure rates by up to 67%.
Fable 5 is included on Pro, Max, Team, and seat-based Enterprise plans only through June 22, 2026; from June 23 it requires usage credits. If you are mid-evaluation, the June 22 deadline post covers what changes, and the migration guide covers the API-side moves.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Constrained generation library for LLMs. Uses finite state machines to mask invalid tokens during generation. Guarantees...
View ToolDictation on Mac that actually works. Uses Whisper locally, with optional LLM post-processing for formatting and punctua...
View ToolAI app builder - describe what you want, get a deployed full-stack app with React, Supabase, and auth. No coding requi...
View ToolOpen-source AI pair programming in your terminal. Works with any LLM - Claude, GPT, Gemini, local models. Git-aware ed...
View ToolInstall Ollama and LM Studio, pull your first model, and run AI locally for coding, chat, and automation - with zero cloud dependency.
Getting StartedManaged scheduling on Anthropic infrastructure with API and GitHub triggers.
Claude CodeInstall the dd CLI and scaffold your first AI-powered app in under a minute.
Getting StartedFable 5 effort levels explained: what low, medium, high, xhigh, and max actually change, which models support each level...
A practical playbook for running Claude Fable 5 as the orchestrator over Sonnet and Haiku workers, with verified cost ma...
Fable 5 posts an 80.3% SWE-Bench Pro score and costs 2x Opus 4.8 - here is the task-profile scoring guide that tells you...
A verified directory of the frontier AI models in June 2026 - Claude Fable 5, GPT-5.5, GPT-5.4, Gemini 3.1 Pro, and Deep...
Rewriting prompts and skills for Fable 5: what changes when you migrate agents from Opus 4.x, how effort interplay works...
Migrating off retired GPT models in 2026: the live retirement table, what maps to what, an eval-before-switch day plan,...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.