AI Agent Containment Needs a Capability Ledger

Anthropic published a useful engineering post in late May 2026 on how it contains Claude across products, and the Hacker News thread immediately turned into the right argument: sandboxing helps, but it does not magically solve prompt injection, egress, credential scope, or the weird trust boundary between "the agent saw a thing" and "the agent can now act on it."

That is the real story for developers building with Claude Code, Codex, MCP tools, background agents, and automated review loops.

The old security model was:

Ask the model to behave. Ask the user for approval. Log what happened.

The new model needs to be:

Give the agent a deterministic capability ledger. Every file, token, network path, tool, identity, and escalation has to be scoped, recorded, revocable, and reviewable.

The post is worth reading because it moves the conversation away from "Claude is safer because it says no" and toward something closer to operating-system design. A model instruction is a preference. A sandbox boundary is a fact. A scoped credential is a fact. A network egress rule is a fact. A short-lived per-session token is a fact.

AI agent security gets better when more of the safety story becomes factual.

The Important Shift: Containment Before Behavior#

Anthropic's framing is simple: contain the environment first, then steer the model. That sounds obvious until you look at how most developers actually run agents.

They install a terminal agent in a real repo. It runs as their user. It can read the files the user can read. It can often see local environment variables. It can run package installers. It can call GitHub, Slack, Linear, Gmail, or a database through an MCP server. It can ingest untrusted issue text, docs, webpages, test output, dependency readmes, and CI logs. Then the user approves commands one by one until approval fatigue kicks in.

That is not containment. That is vibes with a confirmation dialog.

This is the same operational theme behind prompt injection in open source, agent memory as a context ledger, and long-running agent harnesses. The agent is not dangerous because it can write code. It is dangerous because code execution, private context, and external communication can land in the same session without a durable policy object in the middle.

Simon Willison has called that combination the lethal trifecta for AI agents: private data, untrusted content, and external communication. Anthropic's post is basically a product-engineering answer to that trifecta.

The HN pushback sharpened the point. Commenters raised domain fronting, steganography in commits, timing side channels, malicious artifacts that cross from a low-privilege VM into a high-privilege local workflow, and the fact that Docker is not always the boundary people think it is. That does not make the containment work useless. It means "sandboxed" is not a binary label.

Containment has dimensions.

The Capability Ledger#

A capability ledger is the missing product primitive for agent runtimes.

It is not just a permission screen. It is a structured record of what the agent is allowed to touch and why:

Filesystem scope: which directories are readable, writable, or explicitly off limits.
Network scope: which domains and protocols are available, and whether output can leave the environment.
Credential scope: which tokens exist inside the agent environment, how narrow they are, and when they expire.
Tool scope: which MCP servers, CLIs, browser sessions, databases, and hosted APIs are callable.
Identity scope: whether the agent acts as the user, as a session identity, or as a service account.
Review scope: which outputs can move from the sandbox into the real repo, CI, package registry, or customer-facing system.
Memory scope: which facts persist after the run, who can edit them, and how deletion works.

That ledger should live alongside the run, not buried in a settings UI. When an agent opens a PR, ships a migration, comments on an issue, or drafts a release, the review should include the ledger.

What did it read? What did it write? Which external systems did it touch? Which private values were present? Which untrusted sources were mixed into the same context? Which approvals were granted? Which approvals were denied? Which policy widened during the run?

This is why agent receipts matter. A diff tells you what changed. A capability receipt tells you what the agent could have done.

Newsletter

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools, delivered free every week.

From the archive

MAI-Code-1-Flash Is a Model Routing Signal

Jun 3, 2026 • 7 min read

Spreadsheet Agents Need Permission Ledgers

Jun 1, 2026 • 8 min read

Domain Expertise Is the New Agentic Coding Moat

May 31, 2026 • 8 min read

The Agent Security Checklist I Use Before Connecting Tools

May 30, 2026 • 8 min read

Approval Prompts Are Not Enough#

Most local agent tools still lean heavily on interactive approval prompts. That makes sense for early power users. It is also not a long-term security model.

Approval prompts fail in predictable ways:

They show individual commands, not the full capability graph.
They arrive at the moment the user is trying to keep flow.
They make common safe actions and rare dangerous actions feel visually similar.
They do not explain what private context is currently in the model's working set.
They rarely survive as an audit artifact after the run.

If the agent asks to run npm install, what is the user actually approving? Package downloads? Lifecycle scripts? Network calls? Native compilation? Access to the current directory? Reading .npmrc? A future test command that imports the new package?

The right answer is not "never run package installs." The right answer is that package installation should be a named capability with a scoped environment, no unnecessary secrets, a dependency diff, and a clear path back to review. This is the same reason the OpenAI Codex cloud security playbook is more useful than a generic "be careful with agents" warning: the product boundary matters.

Egress Is the Hard Part#

The most interesting part of the HN discussion was not whether Anthropic's exact implementation is perfect. It was the repeated point that exfiltration is the hard part.

If an agent can see private data and can also communicate externally, then prompt injection becomes more than a content-quality problem. It becomes a data-flow problem.

You can block obvious bad domains. You can proxy network calls. You can strip secrets from logs. You can require approval before posting to Slack or opening a browser. Those controls help, but the counterarguments are real:

A trusted domain can be used as a carrier.
Public repo commits can encode data.
Timing and ordering can leak bits.
Generated artifacts can carry a second-stage instruction into a more privileged workflow.
A malicious dependency can turn "just run the tests" into a broader execution path.

This is where "allowlist this domain" becomes too vague. A domain allowlist is not just a connectivity rule. It is an output capability. If the agent can shape a request to a domain, the agent has some ability to transmit information through that channel.

That does not mean agents are unusable. It means egress should be explicit and boring. A coding agent with repo write access does not automatically need access to your email. A research agent with browser access does not automatically need your filesystem. A local file analysis agent does not automatically need internet access. A deployment agent does not automatically need package-publishing credentials.

Separate the roles. Separate the identities. Separate the network.

MCP Makes This More Urgent#

The Model Context Protocol made agent tools easier to connect. That is good. It also made it easier to accidentally turn a chat session into a dense graph of real capabilities.

An MCP server can expose a database query, a CRM action, a GitHub mutation, a local filesystem tool, a Slack sender, a browser, or a custom internal workflow. Each one sounds small in isolation. Together, they become the agent's operating surface.

That surface needs a ledger.

For MCP, the ledger should answer:

Which server provided the tool?
Which version or commit of the server was loaded?
Which tool schema did the model see?
Was the tool read-only, write-capable, or side-effecting?
Which credentials backed the call?
Was the returned content trusted or untrusted?
Did returned content enter a later write-capable step?

This is especially important because tool descriptions are part of the model context. A compromised or sloppy tool can lie about what it does. A server can advertise a harmless description and still return content that changes the next step. That is why MCP debugging needs traces, as covered in MCP debugging with MCP Lens, but security needs a policy layer above traces.

Tracing shows what happened. A capability ledger shows what was possible.

What Teams Should Build Now#

If you are adopting coding agents inside a real team, you do not need to wait for every vendor to standardize this. You can start with a practical containment baseline.

First, split agent profiles by job.

Create separate profiles for research, local code editing, dependency work, production debugging, and deployment. Give each profile the minimum useful capabilities. The research profile can browse and summarize but cannot see secrets. The local editing profile can read and write the repo but cannot push or access broad cloud credentials. The deployment profile can operate only from CI with protected environment rules.

Second, move credentials out of the default environment.

Do not let every agent inherit the same shell session your human user has. Use short-lived tokens. Use repository-scoped tokens. Use service accounts. Use protected CI environments. Make the credential radius visible in review.

Third, treat network access as a write permission.

Outbound network is not just "internet." It is a channel. For some tasks, no network is the correct default. For other tasks, read-only docs access is enough. For still others, a small allowlist with request logging is the right compromise.

Fourth, gate artifact movement.

The dangerous moment is often not inside the sandbox. It is when the artifact leaves it: a patch, a generated config, a dependency lockfile, a browser-exported file, a migration, a release note, or a copied prompt. Make that movement reviewable.

Fifth, store receipts with the work.

For every agent run that touches production code, store a small receipt: policy profile, file scope, network scope, tools used, credentials available, tests run, and human review point. This can be a markdown artifact at first. It does not need to be fancy. It does need to survive the chat.

The Take#

The next competition between AI coding tools will not just be model quality. It will be runtime trust.

Claude Code, Codex, Cursor, Copilot, Devin-style cloud agents, MCP-heavy workflows, and internal agent platforms are all moving toward the same place: agents that can work for longer, touch more systems, and need less babysitting. That only works if the surrounding runtime gets more deterministic as the model gets more capable.

Prompting the model to be careful is table stakes. Asking the user to approve every shell command is a temporary bridge. The durable layer is a capability ledger that makes every run inspectable:

what the agent could read;
what it could write;
where it could send data;
which identity it used;
which untrusted inputs entered context;
which artifacts crossed the boundary;
and who accepted the residual risk.

That is the post-Anthropic-containment lesson for developers: stop treating agent security as a personality trait. Treat it as runtime accounting.

FAQ#

What is AI agent containment?#

AI agent containment is the practice of limiting what an agent can read, write, execute, and communicate with while it works. Strong containment uses environment boundaries, scoped credentials, network controls, and review gates instead of relying only on model instructions.

Why are approval prompts not enough for coding agents?#

Approval prompts show individual actions, but they rarely show the full capability graph. A user may approve a command without seeing which secrets, files, tools, or outbound channels are also available in the same session.

What is a capability ledger?#

A capability ledger is a durable record of an agent run's permissions: filesystem access, network access, credentials, tool calls, identity, memory, and review boundaries. It helps reviewers understand not just what changed, but what the agent was capable of doing.

Does sandboxing solve prompt injection?#

No. Sandboxing reduces blast radius, but prompt injection can still matter when untrusted content, private data, and external communication share a workflow. Sandboxes need egress controls, scoped credentials, artifact review, and clear identity boundaries.

How should teams start securing AI coding agents?#

Start by separating agent profiles by job, removing broad credentials from default shells, treating network access as a write permission, gating artifact movement out of sandboxes, and storing receipts for every meaningful agent run.

Official Sources#

Source	Description
How We Contain Claude - Anthropic Engineering	Anthropic's engineering post on containment strategies across Claude products
The Lethal Trifecta - Simon Willison	Analysis of the dangerous combination of private data, untrusted content, and external communication
Claude Code Overview	Official documentation for Anthropic's terminal-based coding agent
Claude Code Security	Security model and permission controls in Claude Code
Model Context Protocol Specification	Official MCP spec for tool and resource integration
MCP Security Best Practices	Attack vectors, trust boundaries, and mitigations for MCP implementations