
TL;DR
A long-running coding agent is only useful if the environment around it can queue tasks, capture logs, checkpoint state, verify behavior, limit cost, and recover from failure.
Read next
The math of agent pipelines is brutal. 85% reliability per step compounds to about 20% at 10 steps. Here is why long chains collapse in production, and the six patterns the field has converged on to fight the decay.
9 min readAI agents fail in ways traditional debugging cannot catch. Here are the tools and patterns for finding and fixing broken agent loops, tool failures, and context issues.
9 min readHow to spec agent tasks that run overnight and wake up to verified, reviewable code. The spec format, pipeline, and review workflow.
11 min readThe dream version of agents is simple: give the task, close the laptop, wake up to a clean pull request.
The real version is messier. The agent gets stuck on a missing environment variable. A test hangs. A package install fails. The browser never opens. The database seed is stale. The model keeps retrying the same command. The diff is technically correct but unreviewable.
That is not a model problem alone. It is a harness problem.
Long-running agents need infrastructure around them. The model is only one piece. The harness is what gives the run shape: task queue, workspace, tools, logs, checkpoints, budget, verification, and final review.
For the reliability math, read the agent reliability cliff. For debugging runs after they fail, read how to debug AI agent workflows.
An agent harness is the system that wraps the model and the tools.
It answers practical questions:
Without a harness, a long-running agent is just a chat session with a lot of rope.
For coding work, the minimum useful harness has seven parts.
1. A task contract. The task should include the goal, constraints, acceptance criteria, file boundaries, and verification commands. Vague tasks produce vague diffs.
2. A scoped workspace. The agent should work in a repo, branch, sandbox, or worktree with clear boundaries. It should know what it can edit and what it should leave alone.
3. Tool policy. The harness should define safe reads, safe writes, risky commands, denied commands, network access, and approval gates.
4. Persistent logs. Every command, tool call, browser action, and test result should be captured. If the run fails, you need the transcript.
5. Checkpoints. Long tasks should save state after meaningful milestones: plan accepted, implementation done, tests passing, review complete.
6. Verification. The harness should run the actual checks that prove the task is done: tests, lint, typecheck, browser smoke, API probe, screenshot, or deploy health route.
7. Final receipt. The output should say what changed, what passed, what failed, what remains risky, and where to inspect the diff.
That is the baseline. Anything less is a demo.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
May 2, 2026 • 8 min read
May 2, 2026 • 8 min read
May 2, 2026 • 8 min read
May 2, 2026 • 8 min read
Long-running agents fail economically before they fail technically.
A stuck loop can burn tokens for an hour. A cloud agent can keep a sandbox alive while making no progress. A browser session can collect screenshots and logs until the context window is useless.
The harness should track:
Then it should stop the run when the budget is exhausted or progress stalls.
This is the practical side of agent FinOps. You do not need perfect accounting. You need enough telemetry to catch runaway work before the invoice does.
Agents are very good at declaring victory.
That is why the harness should decide what done means. If the task says "fix the checkout bug," the final answer is not enough. The harness should require the checkout test, the API route probe, or a browser flow through the checkout UI.
For frontend work, that might mean:
pnpm typecheck
pnpm test checkout
open browser
complete checkout flow
capture screenshot
check console errors
For backend work:
run focused unit tests
run migration dry-run
hit health endpoint
inspect logs
verify no unexpected schema drift
The exact checks vary. The principle does not: long-running agents need external proof.
The harness should not remove the human. It should move the human to the right point.
Humans should review:
Humans should not babysit:
That is the division of labor that makes agents useful.
Long-running agents do not become reliable because the model got smarter. They become reliable because the system around the model got more disciplined.
The harness is the product. It is what turns an impressive demo into a repeatable workflow.
If your agent cannot show the task contract, logs, checkpoints, verification, cost, and final receipt, it is not ready to run while you sleep.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Anthropic's agentic coding CLI. Runs in your terminal, edits files autonomously, spawns sub-agents, and maintains memory...
View ToolGives AI agents access to 250+ external tools (GitHub, Slack, Gmail, databases) with managed OAuth. Handles the auth and...
View ToolLightweight Python framework for multi-agent systems. Agent handoffs, tool use, guardrails, tracing. Successor to the ex...
View ToolMulti-agent orchestration framework built on the OpenAI Agents SDK. Define agent roles, typed tools, and directional com...
View ToolSpec out AI agents, run them overnight, wake up to a verified GitHub repo.
Open AppVirtualized filesystem on Neon for AI agents. $20/mo Plus.
Open AppVisual designer for Claude Code subagent definitions. Build, test, and export configs.
Open AppConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI AgentsDeep comparison of the top AI agent frameworks - architecture, code examples, strengths, weaknesses, and when to use each one.
AI AgentsBackground monitoring of logs, files, and long-running processes.
Claude Code
The math of agent pipelines is brutal. 85% reliability per step compounds to about 20% at 10 steps. Here is why long cha...

AI agents fail in ways traditional debugging cannot catch. Here are the tools and patterns for finding and fixing broken...
How to spec agent tasks that run overnight and wake up to verified, reviewable code. The spec format, pipeline, and revi...

GitHub is filling with multi-agent frameworks, skills, and coding harnesses. The useful lesson is not that every team ne...

Manual approval prompts stop protecting users when coding agents ask too often. The better pattern is risk-aware autonom...

Efficient agents do not stuff every tool result into the model context. They keep intermediate state in code, files, and...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.