Long-Horizon Agents: What Fable 5's 1M Context and Memory Actually Unlock

This is Part 4 of the Fable 5 agent fleets series. Part 1 covered what Fable 5 is and why the government switched it off. Part 3 is the applied guide to handling refusals at fleet scale so a run does not silently fail. This post is the possibilities one: what a day-long autonomous run actually looks like when a single model holds a million tokens and can write state that outlives its own context.

The shape of a long-horizon run

Most agent frameworks today are built around a hard constraint: the model forgets. Context fills up, you compress or drop history, and the agent loses the thread on anything long. A lot of orchestration complexity - retrieval layers, summarization passes, handoff protocols - exists to work around that single limitation.

Fable 5 changes the constraint's dimensions. It ships with a 1M token context window, up to 128K output tokens per request, and a set of API primitives aimed squarely at runs that last longer than one prompt. That does not make the forgetting problem disappear, but it moves the ceiling high enough that a different class of task becomes a single coherent run instead of a fragile pipeline of stitched-together calls.

Concretely, the building blocks are:

1M token context to hold an entire codebase, a long task history, and the working state of a multi-step job at the same time.
128K output per request to emit large artifacts - a full migration, a generated test suite, a long report - without chopping them across dozens of calls.
A memory tool to write durable state to files that survive beyond the context window, so a run's knowledge is not lost when the window turns over.
Compaction and context editing (beta) to keep a run going even when it outgrows even 1M tokens, by condensing or pruning what is in context without ending the run.
Task budgets (beta) to cap the cost and depth of a long autonomous run so "day-long" does not mean "unbounded spend."
Programmatic tool calling and code execution so the agent can act, not just describe.

Note that adaptive thinking is always on and cannot be disabled; you tune its depth with the effort parameter. For long-horizon work that matters, because deeper reasoning on a hard step is often the difference between a run that stays on track and one that drifts. It also means cost control lives in effort and task budgets, not in a thinking on/off switch.

The concrete anchor: a codebase migration in about a day

The headline example, and the one worth being precise about, is vendor-reported. In the Fable 5 launch post, Anthropic reports that Stripe used Fable 5 to migrate a roughly 50-million-line Ruby codebase in about a day. Anthropic also reports top scores on FrontierCode and CursorBench, and that giving the model file-based memory roughly tripled its gains on long-horizon tasks compared with Opus 4.8, with the lead growing as tasks get longer and more complex.

Treat all of that as partner-reported and benchmark-reported, because it is. It is a strong signal from a credible source, not an independently reproduced result you should quote as a guarantee to your stakeholders. What it establishes is directional: a single model run holding a very large codebase in context, writing state to memory as it goes, and emitting large volumes of changed code is a real workload the vendor is demonstrating, not a hypothetical.

The honest framing for a builder is this. The primitives are verified: 1M context, 128K output, a memory tool, compaction, task budgets, code execution. The magnitude of what Stripe reports is plausible given those primitives but is a vendor claim under conditions you do not control. Your mileage will depend on your codebase's structure, your test coverage, your prompts, and how much of the work is genuinely mechanical versus judgment-heavy. Design your own pilot to find out, rather than assuming the demo transfers one-to-one.

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

From the archive

The MCP 2026-07-28 Rewrite: What Breaks and How to Migrate

Jul 1, 2026 • 11 min read

Webernetes: Kubernetes Ported to the Browser in TypeScript

Jul 1, 2026 • 5 min read

Claude Code Is Steganographically Marking Requests

Jun 30, 2026 • 7 min read

Claude in Microsoft Foundry on Azure: Developer Guide 2026

Jun 30, 2026 • 8 min read

What each primitive actually buys you

1M context: the whole thing in the room

The practical win of a million tokens is not "more history." It is that the agent can reason over a whole system at once. A migration or a repo-wide refactor no longer has to be chunked into files the model sees in isolation, losing cross-file invariants at every boundary. When the entire codebase plus the task's running history fit in context, the model can catch the call site three directories away that your chunked pipeline would have missed. The failure mode of chunked agents - locally correct edits that are globally inconsistent - is exactly what a large context is positioned to reduce.

128K output: artifacts, not fragments

Large output per request means the deliverable can be the artifact itself. A full test suite, a complete migration diff for a module, a long structured report - emitted in one coherent pass rather than assembled from many partial calls that each lose a little context at the seam. Fewer seams means fewer places for inconsistency to creep in.

The memory tool: state that outlives the window

This is the one that changes the character of a run. A context window, however large, is still finite and still turns over on a long enough job. A file-based memory tool lets the agent write down what it has learned - decisions made, conventions discovered, files already handled - so that knowledge persists even after the raw tokens that produced it have scrolled out of context. Anthropic's own reported result, that file memory roughly tripled long-horizon gains over Opus 4.8, points at this being the load-bearing primitive for genuinely long runs, not the raw window size.

Compaction and context editing: runs that outgrow 1M

Even a million tokens runs out on a large enough job. Compaction and context editing (both beta) are the mechanisms for continuing past that ceiling: condensing what is in context and pruning what is no longer needed without ending the run. Combined with memory, this is what turns "a very long single request" into "a genuinely long-horizon agent" - one that can keep working after its context has been reshaped several times.

Task budgets: bounded autonomy

The catch with day-long autonomy is that a runaway agent can spend a lot of money before anyone notices. Task budgets (beta) cap the cost and depth of a run so autonomy stays bounded. This is the primitive that makes long-horizon runs safe to actually turn loose, because "let it work overnight" only makes sense if "it" cannot burn an unbounded amount while you sleep. For the scheduling side of recurring, unattended runs, Claude Code loops covers the native primitive.

Verified versus plausible

It is worth being blunt about the line, because a lot of Fable 5 commentary blurs it.

Verified (documented by Anthropic): 1M token context; up to 128K output per request; a memory tool; code execution; programmatic tool calling; context editing and compaction (beta); task budgets (beta); always-on adaptive thinking tuned by effort; pricing of $10 per 1M input and $50 per 1M output; text plus high-resolution vision input; a 30-day retention requirement.

Vendor or benchmark reported (credible, not independently reproduced here): the Stripe ~50M-line migration in about a day; top FrontierCode and CursorBench scores; file-based memory roughly tripling long-horizon gains over Opus 4.8; the lead growing with task length and complexity.

Plausible but unproven for your workload: that a day-long autonomous run will hold coherence across your specific codebase; that the memory tool will retain the right state for your task without careful prompt design; that costs will land where you expect before you have measured them. These are the things a pilot answers, not a blog post.

Build on the verified primitives. Use the vendor claims as reasons to run a pilot, not as numbers to promise upward.

Six projects to try now

If you have access to Fable 5 and you want to pressure-test long-horizon agents on real work, here are concrete starting points. Each one leans on a different combination of the primitives above.

A codebase-wide migration. Pick a mechanical but sprawling change - a framework version bump, an API rename, a language idiom shift - and let a single run hold the whole repo in context while writing progress to memory. This is the direct analog of the Stripe example. Start on a subsystem, measure coherence and cost, then decide whether to scale.
Repo-wide test authoring. Point the agent at an under-tested codebase and have it generate a coherent test suite in large 128K output passes, using memory to track which modules are covered so it does not duplicate or drift as it works across the repo.
A multi-day research agent. Combine memory and compaction to run a research task that spans far more material than fits in one window - a literature sweep, a competitive teardown, a standards review - where the agent's notes file becomes the durable artifact and the context is repeatedly reshaped around it.
Full documentation regeneration. Have the agent read an entire codebase in context and regenerate docs that stay consistent with the actual implementation, emitting long structured output and using memory to keep terminology and structure uniform across hundreds of pages.
Dependency upgrades at scale. Task a bounded run (task budgets on) with upgrading a dependency across a large monorepo, resolving the cascade of breaking changes with the whole tree visible in context rather than one package at a time.
A long-lived maintenance agent. Give an agent a memory file as its persistent brain and a task budget as its leash, and let it work a backlog over a long session - triaging issues, drafting fixes, updating notes - so its accumulated context survives across many context-window turnovers.

For every one of these, the discipline from Part 3 still applies: check stop_reason on each call, keep an Opus 4.8 fallback, and treat refusal rate as a health metric. A long-horizon run has more surface area for a silent refusal to corrupt the whole job, so the reliability work is not optional just because the model is more capable.

The honest bottom line

Fable 5's long-horizon story is real where it counts: the primitives that make day-long, large-context, memory-backed runs possible are documented and available. The most eye-catching number, the Stripe migration, is a vendor claim that tells you the ceiling is high, not that your run will hit it. The right move is to build on the verified primitives, run a scoped pilot on one of the projects above, measure coherence and cost on your own workload, and let the results - not the launch post - tell you how far to push.

Frequently Asked Questions

Does 1M context mean I no longer need retrieval or memory layers?

No. A larger context reduces how much stitching and retrieval you need for a given job, but a million tokens still turns over on a long enough run. The memory tool exists precisely because durable state has to outlive the window. Anthropic's own reported result, that file memory roughly tripled long-horizon gains over Opus 4.8, suggests memory is the load-bearing primitive for long runs, not raw window size.

Is the Stripe 50-million-line migration something I can rely on?

Treat it as vendor-reported. It is a credible signal from Anthropic's launch post that a very large single-run migration is a real workload, but it was performed under conditions you do not control. Use it as a reason to pilot on a subsystem of your own codebase and measure, not as a number to promise to stakeholders.

How do I keep a day-long run from spending an unbounded amount?

Use task budgets (beta) to cap the cost and depth of a run, and tune reasoning depth with the effort parameter. Adaptive thinking is always on and cannot be switched off, so cost control lives in budgets and effort, not in a thinking toggle. For fleets, also apply the retry-budget discipline from Part 3 so refusal-driven fallbacks do not quietly multiply spend.

What is verified versus just plausible about Fable 5's long-horizon claims?

Verified and documented: 1M context, 128K output, the memory tool, code execution, programmatic tool calling, compaction and context editing (beta), and task budgets (beta). Vendor or benchmark reported: the Stripe migration, top FrontierCode and CursorBench scores, and the file-memory gains over Opus 4.8. Plausible but unproven for your case: that a run will stay coherent and land on-budget for your specific codebase. That last category is what a pilot answers.

The shape of a long-horizon run

The concrete anchor: a codebase migration in about a day

The MCP 2026-07-28 Rewrite: What Breaks and How to Migrate

Webernetes: Kubernetes Ported to the Browser in TypeScript

Claude Code Is Steganographically Marking Requests

Claude in Microsoft Foundry on Azure: Developer Guide 2026