Long-Running Agents Need Harnesses, Not Hope

The dream version of agents is simple: give the task, close the laptop, wake up to a clean pull request.

The real version is messier. The agent gets stuck on a missing environment variable. A test hangs. A package install fails. The browser never opens. The database seed is stale. The model keeps retrying the same command. The diff is technically correct but unreviewable.

That is not a model problem alone. It is a harness problem.

Long-running agents need infrastructure around them. The model is only one piece. The harness is what gives the run shape: task queue, workspace, tools, logs, checkpoints, budget, verification, and final review.

For the reliability math, read the agent reliability cliff. For debugging runs after they fail, read how to debug AI agent workflows.

What a Harness Does

An agent harness is the system that wraps the model and the tools.

It answers practical questions:

Where does the task come from?
What repo or workspace does the agent get?
Which tools are allowed?
Where do logs go?
How are secrets scoped?
What counts as done?
How are costs capped?
What happens when a step fails?
How does a human review the result?

Without a harness, a long-running agent is just a chat session with a lot of rope.

The Minimum Viable Harness

For coding work, the minimum useful harness has seven parts.

1. A task contract. The task should include the goal, constraints, acceptance criteria, file boundaries, and verification commands. Vague tasks produce vague diffs.

2. A scoped workspace. The agent should work in a repo, branch, sandbox, or worktree with clear boundaries. It should know what it can edit and what it should leave alone.

3. Tool policy. The harness should define safe reads, safe writes, risky commands, denied commands, network access, and approval gates.

4. Persistent logs. Every command, tool call, browser action, and test result should be captured. If the run fails, you need the transcript.

5. Checkpoints. Long tasks should save state after meaningful milestones: plan accepted, implementation done, tests passing, review complete.

6. Verification. The harness should run the actual checks that prove the task is done: tests, lint, typecheck, browser smoke, API probe, screenshot, or deploy health route.

7. Final receipt. The output should say what changed, what passed, what failed, what remains risky, and where to inspect the diff.

That is the baseline. Anything less is a demo.

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

From the archive

ML Intern Shows Where Coding Agents Are Heading: Domain Tools, Not Generic Chat

May 2, 2026 • 8 min read

One Tool Beats Ten Endpoints

May 2, 2026 • 8 min read

Open Design Shows the Next Agent Wrapper

May 2, 2026 • 8 min read

OpenAI Codex, Managed Agents, and AWS: What Developers Should Watch

May 2, 2026 • 8 min read

The Cost Cap Matters

Long-running agents fail economically before they fail technically.

A stuck loop can burn tokens for an hour. A cloud agent can keep a sandbox alive while making no progress. A browser session can collect screenshots and logs until the context window is useless.

The harness should track:

elapsed time
model tokens
tool calls
retries
repeated command patterns
step count
sandbox runtime

Then it should stop the run when the budget is exhausted or progress stalls.

This is the practical side of agent FinOps. You do not need perfect accounting. You need enough telemetry to catch runaway work before the invoice does.

Verification Is Not Optional

Agents are very good at declaring victory.

That is why the harness should decide what done means. If the task says "fix the checkout bug," the final answer is not enough. The harness should require the checkout test, the API route probe, or a browser flow through the checkout UI.

For frontend work, that might mean:

pnpm typecheck
pnpm test checkout
open browser
complete checkout flow
capture screenshot
check console errors

For backend work:

run focused unit tests
run migration dry-run
hit health endpoint
inspect logs
verify no unexpected schema drift

The exact checks vary. The principle does not: long-running agents need external proof.

The Human Review Layer

The harness should not remove the human. It should move the human to the right point.

Humans should review:

task interpretation
final diff
security-sensitive changes
database migrations
production deploys
surprising behavior
failed verification

Humans should not babysit:

reading files
running tests
retrying installs
collecting logs
summarizing obvious errors

That is the division of labor that makes agents useful.

The Bottom Line

Long-running agents do not become reliable because the model got smarter. They become reliable because the system around the model got more disciplined.

The harness is the product. It is what turns an impressive demo into a repeatable workflow.

If your agent cannot show the task contract, logs, checkpoints, verification, cost, and final receipt, it is not ready to run while you sleep.

Sources

Anthropic: How Anthropic teams use Claude Code
Anthropic Engineering: Claude Code auto mode
DevDigest: The Agent Reliability Cliff
DevDigest: How to Debug AI Agent Workflows

The Agent Reliability Cliff: Why Your 10-Step Chain Only Succeeds 20% of the Time

How to Debug AI Agent Workflows

Ship Code While You Sleep: The Overnight Agent Workflow

What a Harness Does

The Minimum Viable Harness