
TL;DR
1M context, 128K output, a memory tool, compaction, and task budgets change what a single agent run can cover. Here is what is verified, what is plausible, and six projects builders can try now.
This is Part 4 of the Fable 5 agent fleets series. Part 1 covered what Fable 5 is and why the government switched it off. Part 3 is the applied guide to handling refusals at fleet scale so a run does not silently fail. This post is the possibilities one: what a day-long autonomous run actually looks like when a single model holds a million tokens and can write state that outlives its own context.
Most agent frameworks today are built around a hard constraint: the model forgets. Context fills up, you compress or drop history, and the agent loses the thread on anything long. A lot of orchestration complexity - retrieval layers, summarization passes, handoff protocols - exists to work around that single limitation.
Fable 5 changes the constraint's dimensions. It ships with a 1M token context window, up to 128K output tokens per request, and a set of API primitives aimed squarely at runs that last longer than one prompt. That does not make the forgetting problem disappear, but it moves the ceiling high enough that a different class of task becomes a single coherent run instead of a fragile pipeline of stitched-together calls.
Concretely, the building blocks are:
Note that adaptive thinking is always on and cannot be disabled; you tune its depth with the effort parameter. For long-horizon work that matters, because deeper reasoning on a hard step is often the difference between a run that stays on track and one that drifts. It also means cost control lives in effort and task budgets, not in a thinking on/off switch.
The headline example, and the one worth being precise about, is vendor-reported. In the Fable 5 launch post, Anthropic reports that Stripe used Fable 5 to migrate a roughly 50-million-line Ruby codebase in about a day. Anthropic also reports top scores on FrontierCode and CursorBench, and that giving the model file-based memory roughly tripled its gains on long-horizon tasks compared with Opus 4.8, with the lead growing as tasks get longer and more complex.
Treat all of that as partner-reported and benchmark-reported, because it is. It is a strong signal from a credible source, not an independently reproduced result you should quote as a guarantee to your stakeholders. What it establishes is directional: a single model run holding a very large codebase in context, writing state to memory as it goes, and emitting large volumes of changed code is a real workload the vendor is demonstrating, not a hypothetical.
The honest framing for a builder is this. The primitives are verified: 1M context, 128K output, a memory tool, compaction, task budgets, code execution. The magnitude of what Stripe reports is plausible given those primitives but is a vendor claim under conditions you do not control. Your mileage will depend on your codebase's structure, your test coverage, your prompts, and how much of the work is genuinely mechanical versus judgment-heavy. Design your own pilot to find out, rather than assuming the demo transfers one-to-one.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Jul 1, 2026 • 11 min read
Jul 1, 2026 • 5 min read
Jun 30, 2026 • 7 min read
Jun 30, 2026 • 8 min read
The practical win of a million tokens is not "more history." It is that the agent can reason over a whole system at once. A migration or a repo-wide refactor no longer has to be chunked into files the model sees in isolation, losing cross-file invariants at every boundary. When the entire codebase plus the task's running history fit in context, the model can catch the call site three directories away that your chunked pipeline would have missed. The failure mode of chunked agents - locally correct edits that are globally inconsistent - is exactly what a large context is positioned to reduce.
Large output per request means the deliverable can be the artifact itself. A full test suite, a complete migration diff for a module, a long structured report - emitted in one coherent pass rather than assembled from many partial calls that each lose a little context at the seam. Fewer seams means fewer places for inconsistency to creep in.
This is the one that changes the character of a run. A context window, however large, is still finite and still turns over on a long enough job. A file-based memory tool lets the agent write down what it has learned - decisions made, conventions discovered, files already handled - so that knowledge persists even after the raw tokens that produced it have scrolled out of context. Anthropic's own reported result, that file memory roughly tripled long-horizon gains over Opus 4.8, points at this being the load-bearing primitive for genuinely long runs, not the raw window size.
Even a million tokens runs out on a large enough job. Compaction and context editing (both beta) are the mechanisms for continuing past that ceiling: condensing what is in context and pruning what is no longer needed without ending the run. Combined with memory, this is what turns "a very long single request" into "a genuinely long-horizon agent" - one that can keep working after its context has been reshaped several times.
The catch with day-long autonomy is that a runaway agent can spend a lot of money before anyone notices. Task budgets (beta) cap the cost and depth of a run so autonomy stays bounded. This is the primitive that makes long-horizon runs safe to actually turn loose, because "let it work overnight" only makes sense if "it" cannot burn an unbounded amount while you sleep. For the scheduling side of recurring, unattended runs, Claude Code loops covers the native primitive.
It is worth being blunt about the line, because a lot of Fable 5 commentary blurs it.
Verified (documented by Anthropic): 1M token context; up to 128K output per request; a memory tool; code execution; programmatic tool calling; context editing and compaction (beta); task budgets (beta); always-on adaptive thinking tuned by effort; pricing of $10 per 1M input and $50 per 1M output; text plus high-resolution vision input; a 30-day retention requirement.
Vendor or benchmark reported (credible, not independently reproduced here): the Stripe ~50M-line migration in about a day; top FrontierCode and CursorBench scores; file-based memory roughly tripling long-horizon gains over Opus 4.8; the lead growing with task length and complexity.
Plausible but unproven for your workload: that a day-long autonomous run will hold coherence across your specific codebase; that the memory tool will retain the right state for your task without careful prompt design; that costs will land where you expect before you have measured them. These are the things a pilot answers, not a blog post.
Build on the verified primitives. Use the vendor claims as reasons to run a pilot, not as numbers to promise upward.
If you have access to Fable 5 and you want to pressure-test long-horizon agents on real work, here are concrete starting points. Each one leans on a different combination of the primitives above.
A codebase-wide migration. Pick a mechanical but sprawling change - a framework version bump, an API rename, a language idiom shift - and let a single run hold the whole repo in context while writing progress to memory. This is the direct analog of the Stripe example. Start on a subsystem, measure coherence and cost, then decide whether to scale.
Repo-wide test authoring. Point the agent at an under-tested codebase and have it generate a coherent test suite in large 128K output passes, using memory to track which modules are covered so it does not duplicate or drift as it works across the repo.
A multi-day research agent. Combine memory and compaction to run a research task that spans far more material than fits in one window - a literature sweep, a competitive teardown, a standards review - where the agent's notes file becomes the durable artifact and the context is repeatedly reshaped around it.
Full documentation regeneration. Have the agent read an entire codebase in context and regenerate docs that stay consistent with the actual implementation, emitting long structured output and using memory to keep terminology and structure uniform across hundreds of pages.
Dependency upgrades at scale. Task a bounded run (task budgets on) with upgrading a dependency across a large monorepo, resolving the cascade of breaking changes with the whole tree visible in context rather than one package at a time.
A long-lived maintenance agent. Give an agent a memory file as its persistent brain and a task budget as its leash, and let it work a backlog over a long session - triaging issues, drafting fixes, updating notes - so its accumulated context survives across many context-window turnovers.
For every one of these, the discipline from Part 3 still applies: check stop_reason on each call, keep an Opus 4.8 fallback, and treat refusal rate as a health metric. A long-horizon run has more surface area for a silent refusal to corrupt the whole job, so the reliability work is not optional just because the model is more capable.
Fable 5's long-horizon story is real where it counts: the primitives that make day-long, large-context, memory-backed runs possible are documented and available. The most eye-catching number, the Stripe migration, is a vendor claim that tells you the ceiling is high, not that your run will hit it. The right move is to build on the verified primitives, run a scoped pilot on one of the projects above, measure coherence and cost on your own workload, and let the results - not the launch post - tell you how far to push.
No. A larger context reduces how much stitching and retrieval you need for a given job, but a million tokens still turns over on a long enough run. The memory tool exists precisely because durable state has to outlive the window. Anthropic's own reported result, that file memory roughly tripled long-horizon gains over Opus 4.8, suggests memory is the load-bearing primitive for long runs, not raw window size.
Treat it as vendor-reported. It is a credible signal from Anthropic's launch post that a very large single-run migration is a real workload, but it was performed under conditions you do not control. Use it as a reason to pilot on a subsystem of your own codebase and measure, not as a number to promise to stakeholders.
Use task budgets (beta) to cap the cost and depth of a run, and tune reasoning depth with the effort parameter. Adaptive thinking is always on and cannot be switched off, so cost control lives in budgets and effort, not in a thinking toggle. For fleets, also apply the retry-budget discipline from Part 3 so refusal-driven fallbacks do not quietly multiply spend.
Verified and documented: 1M context, 128K output, the memory tool, code execution, programmatic tool calling, compaction and context editing (beta), and task budgets (beta). Vendor or benchmark reported: the Stripe migration, top FrontierCode and CursorBench scores, and the file-memory gains over Opus 4.8. Plausible but unproven for your case: that a run will stay coherent and land on-budget for your specific codebase. That last category is what a pilot answers.
Read next
Fable 5 refusals come back as a 200 response, not an error. At fleet scale, that quietly corrupts entire runs. Here is how to detect, fall back, and treat refusal rate as a health metric.
9 min readVercel's eve gives you the agent plumbing - durable sessions, sandboxed code execution, approvals, subagents - as a folder of files. Fable 5 gives you a long-horizon reasoning model. Here is how to wire them together, what it costs, and who the stack fits.
9 min readFable 5 changes multi-agent orchestration because the orchestrator can now hold the whole project in one head. Here is the manager-model pattern: a 1M-context frontier model leading, delegating scoped work to cheaper workers, and verifying results.
8 min readTechnical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Anthropic's first generally available Mythos-class model, released June 9, 2026. 1M context, 128K max output, $10/$50 pe...
View ToolAnthropic's agentic coding CLI. Runs in your terminal, edits files autonomously, spawns sub-agents, and maintains memory...
View ToolAnthropic's AI. Opus 4.6 for hard problems, Sonnet 4.6 for speed, Haiku 4.5 for cost. 200K context window. Best coding m...
View ToolAnthropic's flagship reasoning model. Best-in-class for coding, long-context analysis, and agentic workflows. 1M token c...
View ToolInteractive timeline showing what's in context at each turn.
Claude CodeContext-aware follow-up suggestions derived from git history.
Claude CodeA practical walk-through of how to design, write, and ship a Claude Code skill - from choosing when to trigger, through allowed-tools, to the steps the agent will actually follow.
Getting Started
Anthropic Suspends Fable 5 & Mythos 5 After US Export Control Directive (Jailbreak Concerns) Anthropic announced that the US government issued export control directives requiring it to suspend Fable

Claude Fable 5 Released: Benchmarks, Pricing, Availability, and Real-World Examples Anthropic has released Claude Fable 5, the first general-use “Mythos class” model, and the video reviews the announ

Anthropic's Big Claude Code & Cowork Update: Remote Control, Scheduled Tasks, Plugins, Auto Memory + New Simplify/Batch Skills The script recaps a consolidated update on new Anthropic releases across

Fable 5 refusals come back as a 200 response, not an error. At fleet scale, that quietly corrupts entire runs. Here is h...

Fable 5 changes multi-agent orchestration because the orchestrator can now hold the whole project in one head. Here is t...

Standing up a fleet of Fable 5 agents is the easy part. This is the operations layer - data retention rules, refusal-rat...

Anthropic's most capable model launched, got suspended by a US export-control order, and returned today. Here is what Fa...

Vercel's eve gives you the agent plumbing - durable sessions, sandboxed code execution, approvals, subagents - as a fold...

The orchestrator is the most important model choice in an agent fleet. A fair head-to-head between Fable 5 and Opus 4.8...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.