
AI Agents Deep Dive
11 partsTL;DR
The math of agent pipelines is brutal. 85% reliability per step compounds to about 20% at 10 steps. Here is why long chains collapse in production, and the six patterns the field has converged on to fight the decay.
Read next
From single-agent baselines to multi-level hierarchies, these are the seven patterns for wiring AI agents together in production. Each with a decision rule, an implementation sketch, and the tradeoffs that actually matter.
10 min readA new study from nrehiew quantifies a problem every Claude Code, Cursor, and Codex user has felt: models making huge diffs for tiny fixes. Here is why it happens, why tests do not catch it, and what to do about it.
8 min readHow to use Claude Code's Task tool, custom sub-agents, and worktrees to run parallel development workflows. Real prompt examples, agent configurations, and workflow patterns from daily use.
11 min readYour agent pipeline is a compound probability problem, and the compound destroys you faster than you think.
Assume a single agent step succeeds 85 percent of the time. That is a generous number. Real-world per-step reliability for non-trivial tasks is often closer to 80 percent. Now chain ten of those steps together. The end-to-end success rate is 0.85 to the tenth, which is roughly 20 percent.
That is the agent reliability cliff. Five steps holds up decently at 44 percent end-to-end. Ten steps collapses to 20 percent. Twenty steps is effectively zero. This is not a tooling problem or a model problem. It is the arithmetic.
Everyone who has shipped a multi-agent pipeline has hit this wall. You prototype a five-step flow and the demo video looks amazing. You add five more steps for the production edge cases and suddenly four out of five runs fail partway through. You blame the model, or the prompt, or Anthropic having a bad day, and you cannot figure out why it worked yesterday and it does not work today. The answer is usually that it was never really working. Your demo was sampling the 20 percent of runs that happened to succeed end-to-end.
This post is a rough map of what the field has converged on to fight this. The patterns are not new, but they are finally common vocabulary in production agent work.
Per-step reliability is a moving target. Here is what affects it.
Task complexity. An agent that summarizes a paragraph succeeds above 95 percent. An agent that writes a bug fix to a real codebase succeeds in the 60-80 percent range depending on the specificity of the bug. An agent that has to decide whether to refactor or patch is rarely above 70 percent.
Context pressure. Agents that operate near their context window limit start truncating, hallucinating, and dropping instructions. A prompt that works at 10k tokens of context may fail at 100k tokens on the same task.
Tool use surface area. Every tool the agent can call is a place where the agent can call the wrong tool, pass the wrong arguments, or interpret the response incorrectly. More tools means more error modes.
Environment variance. The exact same prompt produces different outputs run to run. Temperature is usually non-zero. Even with temperature zero, provider-side fluctuations matter.
The practical implication: optimistic planners assume 90 percent per step and build 15-step pipelines. Realistic planners assume 80 percent per step and cap chains at five steps, with verification gates between them. The realistic planners are the ones whose agents ship.
After two years of production agent work, six reliability patterns keep showing up. None of them are exotic. All of them are boring infrastructure-engineering instincts applied to agent pipelines.
The baseline. If a step fails, retry it. Exponential backoff between retries to avoid hammering a flaky upstream. Cap at three or four retries, then escalate.
Retry only works for transient failures. A hallucination in agent output is not a transient failure. It is often a deterministic failure that will reproduce on retry. So retry is necessary but insufficient.
The step produces output and also a verification pass. If verification fails, the step retries itself with the failure as context. This is the evaluator-optimizer loop in production.
The key is that the verification has to be executable. A test suite. A schema validator. A linter. A rubric evaluated by another LLM. If "verification" is the same LLM eye-balling its own output, you have not actually added verification. You have added self-congratulation.
This pattern works extraordinarily well for code generation. Generate, run the tests, loop if they fail, cap at three iterations. Empirically, a single retry on a test failure captures 60-70 percent of the cases where the first attempt was wrong.
Underused. If the error rate from a specific sub-agent exceeds a threshold over a rolling window, stop routing work to it and alert. Prevents cascade failures where one broken agent drags the whole system down.
Circuit breakers are the pattern that lets you run the pipeline overnight without waking up to a $400 API bill from an infinite retry loop.
Long-running agent work should save state after every major step. When a failure happens in step seven, you resume from the checkpoint before step seven, not from step one.
The state has to be external. Postgres, Redis, Convex, a flat JSON file. Anything but "in the agent's memory" because the agent's memory is what just failed.
Checkpointing changes the math. A 10-step chain with checkpoints and per-step retry is functionally a 10 x 2-step chain, which at 85 percent reliability is 72 percent end-to-end. Ten times more reliable than the same chain without checkpoints.
Sometimes the right move is to stop and escalate rather than retry. If the agent has tried three approaches and all failed, the fourth will probably fail too. Stop. Surface the failure to a human or a different specialist agent.
Early stopping is a cost control as much as a reliability control. The failure mode without early stopping is $50 of tokens burned on an agent stuck in a confusion loop.
The best agents know when to ask. The worst agents never ask. Every production pipeline needs a clear definition of the conditions under which the agent stops and requests human input rather than continuing to guess.
Common escalation triggers: the agent has made the same kind of mistake twice. The agent's confidence (if you can measure it) drops below a threshold. The next action crosses a blast-radius line the agent is not authorized to cross alone.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Apr 22, 2026 • 7 min read
Apr 22, 2026 • 7 min read
Apr 22, 2026 • 8 min read
Apr 22, 2026 • 7 min read
The reliability patterns above assume agents are stateless and state lives somewhere durable. This is the production winner and the debate is mostly over.
Agents load their state at the start of a run and save it at the end. Benefits: resumability, full auditability, parallel execution without race conditions, the ability to run the same agent on the same state from two different machines without corruption.
Message queue architectures (Redis Pub/Sub, SQS, Kafka) decouple agents for high-volume workloads. They are overkill for most agent pipelines, but if your workload spikes or needs durable semantics, Temporal or Inngest are the two orchestrators most agent teams pick.
Cost optimization and reliability are tied. A pipeline that costs $0.10 per run can tolerate 20 percent success rates if retries are free. A pipeline that costs $5 per run cannot.
The three cost levers that matter in production:
Model tier routing. Use the cheapest model that can handle each sub-task. Tiny models for routing and classification. Medium models for worker tasks. Large models only for planning and synthesis. A Claude Opus orchestrator with Haiku workers can cut costs by an order of magnitude versus using Opus throughout, with minimal quality loss on the worker side.
Prompt caching. Anthropic and OpenAI both support prompt caching. System prompts and tool definitions repeated across thousands of agent calls should hit 70 percent-plus cache rates. At current pricing that is a 60-90 percent cost reduction on cached tokens. If you are not using prompt caching in production, you are leaving money on the table.
Batch API. Both Anthropic and OpenAI offer 50 percent discounts for async batch jobs with 24-hour windows. Most agent pipelines have at least some batchable work: nightly report generation, bulk document analysis, overnight data enrichment. Run the non-urgent parts of your pipeline through the batch API and cut that cost line in half.
Put the whole picture together and a reliable production agent pipeline looks like this.
Five steps or fewer in any single chain. Each step has executable verification. Each step is checkpointed to durable state. Retries are bounded with exponential backoff. Circuit breakers guard expensive sub-agents. Unrecoverable failures escalate to a human via a defined trigger. The cheapest sufficient model runs each step. Prompt caching hits 70 percent-plus. Batch API absorbs anything that does not need to run now.
That is not exciting. It is plumbing. But it is plumbing that ships, and plumbing that does not ship is why 80 percent of demo agent pipelines never make it to production.
The field is converging on two answers to the reliability cliff.
The first is "smaller, better-scoped steps." If you cannot make a single step 95 percent reliable, break it into two steps that are each 97 percent reliable and run them with verification between.
The second is "specialized sub-models fine-tuned for narrow tasks." A sub-model trained specifically to produce JSON that passes a schema check is more reliable than a general model asked to produce JSON and hope. This is where agent-specialized fine-tuning is headed over the next eighteen months.
In the meantime, the boring answer is the right answer. Shorter chains. Executable verification. Durable state. Bounded retries. Circuit breakers. Escalation paths. Cheapest sufficient model. Caching. Batching.
Agent reliability is an infrastructure problem. Most of what looks like AI is just good plumbing.
The agent reliability cliff describes how per-step success rates compound exponentially across multi-step agent pipelines. If each step succeeds 85% of the time, a 10-step chain only succeeds about 20% end-to-end (0.85^10). The math is brutal: 5 steps holds at 44% success, 10 steps drops to 20%, and 20 steps is effectively zero. Most failed agent projects hit this wall without understanding the compound probability math.
Demo videos sample the runs that happened to succeed. A 10-step pipeline with 20% end-to-end success means four out of five runs fail partway through. In a demo, you re-record until you get a good run. In production, users see every failure. The pipeline was never really working - the demo was just statistical cherry-picking.
The six patterns that work: (1) Retry with exponential backoff for transient failures, (2) Self-healing loops with executable verification (tests, schema validators, linters), (3) Circuit breakers to stop cascade failures, (4) Checkpoint and resume to avoid restarting from scratch, (5) Early stopping after repeated failures, and (6) Human escalation when the agent should ask rather than guess. None are exotic - they are boring infrastructure engineering applied to agent pipelines.
Verification that runs code, not vibes. A test suite that passes or fails. A JSON schema validator. A linter. A rubric evaluated by a separate LLM. If verification is the same LLM reviewing its own output, that is not verification - it is self-congratulation. Executable verification is what makes self-healing loops actually heal.
Circuit breakers stop routing work to a sub-agent when its error rate exceeds a threshold over a rolling window. They prevent cascade failures where one broken agent drags the whole system down. More importantly, they prevent the $400 overnight API bill from an infinite retry loop. If you run agent pipelines unattended, circuit breakers are mandatory.
Checkpointing changes the math. A 10-step chain with checkpoints and per-step retry is functionally a series of 2-step chains (step + retry). At 85% reliability, that is 72% end-to-end instead of 20%. The key: state must be external (Postgres, Redis, Convex, flat file), not in the agent's memory, because the agent's memory is what just failed.
Realistic planners cap chains at 5 steps with verification gates between them. At 80% per-step reliability, 5 steps gives 33% end-to-end success. 10 steps gives 11%. Optimistic planners assume 90% and build 15-step pipelines that never ship. If you need more than 5 steps, add checkpoints and treat it as multiple 5-step chains with durable state handoffs.
Three levers: (1) Model tier routing - use tiny models for classification, medium models for workers, and large models only for planning. Claude Opus orchestrator + Haiku workers cuts costs 10x. (2) Prompt caching - system prompts and tool definitions should hit 70%+ cache rates for 60-90% cost reduction. (3) Batch API - 50% discount for async jobs. Cost and reliability are tied: a $0.10 pipeline can tolerate 20% success; a $5 pipeline cannot.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Multi-agent orchestration framework built on the OpenAI Agents SDK. Define agent roles, typed tools, and directional com...
View ToolAnthropic's Python SDK for building production agent systems. Tool use, guardrails, agent handoffs, and orchestration. R...
View ToolGives AI agents access to 250+ external tools (GitHub, Slack, Gmail, databases) with managed OAuth. Handles the auth and...
View ToolOpen-source AI orchestration framework by deepset. Modular pipelines for RAG, agents, semantic search, and multimodal ap...
View ToolWhat MCP servers are, how they work, and how to build your own in 5 minutes.
AI AgentsStep-by-step guide to building an MCP server in TypeScript - from project setup to tool definitions, resource handling, testing, and deployment.
AI AgentsA practical walk-through of how to design, write, and ship a Claude Code skill - from choosing when to trigger, through allowed-tools, to the steps the agent will actually follow.
Getting Started
From single-agent baselines to multi-level hierarchies, these are the seven patterns for wiring AI agents together in pr...

A new study from nrehiew quantifies a problem every Claude Code, Cursor, and Codex user has felt: models making huge dif...

How to use Claude Code's Task tool, custom sub-agents, and worktrees to run parallel development workflows. Real prompt...

AI agents use LLMs to complete multi-step tasks autonomously. Here is how they work and how to build them in TypeScript.

A practical guide to building AI agents with TypeScript using the Vercel AI SDK. Tool use, multi-step reasoning, and rea...

From swarms to pipelines - here are the patterns for coordinating multiple AI agents in TypeScript applications.

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.