
TL;DR
Five managed-agent providers, five pricing models, zero unified cost attribution. If you're running agents overnight, you need FinOps you don't have yet.
Read next
A practical guide to building AI agents with TypeScript using the Vercel AI SDK. Tool use, multi-step reasoning, and real patterns you can ship today.
10 min readAI agents use LLMs to complete multi-step tasks autonomously. Here is how they work and how to build them in TypeScript.
6 min readA deep analysis of what AI coding tools actually cost when you factor in usage patterns, hidden limits, and real-world workflows. Pricing tables, decision matrices, and recommendations for every developer profile.
13 min readI woke up to a $437 bill from an agent I asked to write a TypeScript refactor.
For cost context, read AI Agents Explained: A TypeScript Developer's Guide alongside How to Build AI Agents in TypeScript; together they separate sticker price from the operational habits that make agent work expensive.
Last updated: May 25, 2026. Managed-agent features and pricing change quickly. Treat this as an operations guide, and confirm current billing behavior in the official sources below before setting policy.
Start with these primary references so you can validate pricing units and operational controls:
| Topic | Official source |
|---|---|
| Claude Managed Agents (overview) | Anthropic launch post |
| Claude Managed Agents (docs) | Managed agents docs |
| OpenAI Agents SDK (TypeScript docs) | openai-agents-js docs |
| OpenAI Agents SDK (TypeScript repo) | openai/openai-agents-js |
| Devin pricing | devin.ai/pricing |
| Cursor pricing | cursor.com/pricing |
| GitHub Copilot pricing | GitHub Copilot pricing |
| OpenTelemetry (tracing foundation) | OpenTelemetry |
The task was not small but it was not $437 big. Port a mid-sized service from a bespoke event emitter to a typed pub/sub primitive, update the tests, open a PR. I spun it up around 11pm, watched the first few steps, saw it start editing the right files, and went to bed. When I opened my laptop at 7am there was no PR. There was a still-running session, a loop counter past 200, and a billing dashboard that made me close the tab and open it again to make sure I was reading it right.
What actually happened was mundane. The agent hit a failing test, tried to fix it, broke a neighboring test, tried to fix that, went back to the first one, and kept oscillating. Each pass retried a web search. Each pass re-read the file tree. There was no per-session cap. There was no cross-provider ceiling. There was no kill switch attached to dollars. There was only the assumption, inherited from a decade of SaaS, that the bill at the end of the month would be roughly the bill I expected at the start of it. Managed agents broke that assumption and nobody has rebuilt it yet.
The managed-agent category just crystallized. Anthropic launched Claude Managed Agents with a pricing surface that combines model tokens with session-style runtime billing and tool surcharges. OpenAI's Agents SDK sits on standard API tokens plus hosted runtime costs in supported sandbox environments. That is two frontier providers, two pricing models, already incompatible before you leave the first tier. Confirm the current units and rates in the official sources above, and treat any blog-post number you see (including mine) as stale by default.
Past that tier it fractures more. Devin prices in ACUs, where one ACU is roughly 15 agent-minutes and plans scale from $20 pay-as-you-go to $500 for 250 ACUs. Cursor Background Agents ride on a Pro subscription with metered Max-mode overflow, and field reports are landing around $4.63 per "easy" PR. GitHub Copilot charges per-seat with 300 premium requests bundled and $0.04 for every overflow request. Replit has effort-based credits with Economy, Power, and Turbo tiers. Jules is free up to 15 tasks a day, then tiered. Factory is token-based with no per-seat floor.
Five pricing models (tokens, session-hour, per-task/ACU, per-seat with overflow, outcome-based) layered with infinite hybrids. No unified attribution. Every provider shows you their own dashboard, their own units, their own refresh cadence. When an overnight run goes sideways, piecing together what actually happened requires opening five browser tabs, correlating timestamps by hand, and trusting that each dashboard updated before you looked at it. The category grew up faster than its observability did, and FinOps is the thing that is missing.
In my own logs and in enough war stories from other teams to call it a pattern, three specific failure modes account for almost every overnight blowup.
1. Pathological loops. The agent retries the same broken test two hundred times. This is what happened to me. It usually starts with a test that fails for a reason the agent cannot see (an environment variable, a flaky external call, a test depending on order), and the agent interprets the red bar as a code problem. It edits, reruns, edits, reruns. With session-billed managed agents, an eight-hour loop can get expensive fast even before you add tool surcharges. Prevention: every agent run gets a max-iterations cap in the harness, every test failure that recurs more than three times triggers an escalation to a human, and every session gets a hard dollar ceiling that kills the process rather than "warns" the user.
2. Tool call explosion. The agent decides it needs context and calls web search in a hot loop. Claude Managed Agents bills web search at $10 per 1,000 queries. It takes one broken retry pattern to rack up 3,000 searches across a single task, which is $30 in tool calls on top of whatever the model tokens cost. I have seen a Cursor Background Agent run up 400 MCP calls in a session that should have needed six. Prevention: set per-tool quotas at the harness level (max 50 web searches per session, max 100 file reads, max 10 shell executions), log every tool call with its cost, and treat tool-call count as a first-class alert metric.
3. Context swapping. The agent keeps re-reading the same file tree. This one is quiet. Each pass through a task, the agent reloads the project structure, rereads five or ten files it has already seen, and pushes them into a fresh context window. On a 1M-context model like GPT-5.4, this is cheap in wall time but expensive in tokens because you are sending 300K input tokens per iteration and compaction does not always kick in the way you want. Ten iterations at 300K input tokens each is 3M tokens per session, and on GPT-5.3-Codex at $1.75 input per million that is $5 per run, per agent, before you add output. Run five agents in parallel overnight and you spent $25 on file rereads. Prevention: force the harness to use compaction aggressively, cache file contents with hash keys across iterations, and instrument input-token growth per step so that a reread spike is visible before the bill is.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Apr 19, 2026 • 8 min read
Apr 19, 2026 • 9 min read
Apr 19, 2026 • 10 min read
Apr 19, 2026 • 12 min read
No one has cross-provider cost attribution. Say that again with the weight it deserves. You can go to Anthropic's dashboard and see tokens and session hours. You can go to OpenAI's dashboard and see tokens and sandbox storage. You can go to Devin and see ACUs. You can go to Cursor and see Max-mode overflow. You cannot go to any dashboard today and see a single task called "refactor the event emitter" with the total cost across the two frontier providers, the one sandbox vendor, and the one web-search tool it touched. That span does not exist.
What you need is straightforward to describe and has been refused by the market for eighteen months. You need unified spans tagged per-agent, per-user, per-task, with model tokens, session time, tool calls, and dollar cost attached to each span. You need parent-child span relationships so that a task like "run the test suite" groups its twenty tool calls under a single parent, and a larger task like "ship this PR" groups the test-suite span under itself. You need the ability to filter by provider, by model, by user, and by task and get the exact dollar number out.
This is the OpenTelemetry trace model. OTel already has the vocabulary: traces, spans, resource attributes, semantic conventions. The GenAI semantic conventions already define gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, and gen_ai.usage.output_tokens. What is missing is adoption by managed-agent providers, a coherent cost enrichment layer sitting on top of those spans, and a receiver that holds the data close enough to you that you can ask questions across providers without waiting for each vendor's BI team to ship a feature.
DD Traces was built for exactly this gap. Today it does local OTel trace collection. The minimal OTLP receiver shipped last week and already ingests spans from Claude Code, Cursor, Codex, Augment, Gemini, and Kimi sessions running on your machine. You point an OTLP exporter at localhost:4318, you run your agents, and you get a unified view of what ran where.
Cost attribution is the next step and I am not going to pretend it is done. The piece that exists is the trace plumbing and the cross-tool schema. The piece that is TODO is the enrichment layer that reads a span's provider and model, looks up the current pricing, multiplies the tokens and tool calls, and attaches a dollar figure as a span attribute. That layer is in the roadmap, not in main, and the honest framing is: we have the pipes, we have the schema, we do not yet have the pricing table plugged in or the budget-ceiling kill switch wired up.
Where we are thinking next: a pricing table for Claude Managed Agents, OpenAI Agents SDK, Devin, Cursor, and Copilot that updates weekly. A dollars_spent attribute enriched onto every span at ingest time. A dashboard view that filters by provider, model, user, and task. A budget alert that fires when a single trace crosses a ceiling, with a follow-on kill hook that can signal the harness to stop. That is what FinOps for managed agents looks like and it is what DD Traces is building toward.
If you want to watch that happen, the project is at traces.developersdigest.tech and the receiver is MIT licensed. Opinions on semantic conventions, pricing sources, and kill-hook design are welcome and frankly necessary.
You do not need DD Traces to protect yourself this week. Three concrete things, in order.
1. Set hard dollar caps in each provider's UI. Anthropic, OpenAI, Cursor, Devin, Copilot, and Replit all have usage limits under billing settings. Set them now, before the next overnight run. A cap that lives in the provider dashboard will not save you from bad code but it will save you from the worst version of "forgot to set a cap." Before you set the cap, run a quick estimate through our AI cost calculator so the number you pick is grounded in actual token math rather than a guess. Set it per-workspace, set it per-key, set it aggressive (a 2x your expected monthly spend is fine, 10x is not).
2. Run agents with OTel tracing turned on. Every major agent framework now supports OTLP export: Claude Agent SDK, OpenAI Agents SDK, LangGraph, Mastra, Vercel AI SDK. Set OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 and send spans to any collector (DD Traces, Langfuse local, a plain file sink). Even without cost enrichment, having the traces on disk means you can reconstruct what happened when the bill arrives.
3. Write a $10 sentinel loop. The top action, the one that matters most if you do nothing else tonight: a shell script or cron job that polls each provider's usage API every 5 minutes, sums the spend since session start, and kills the agent process if it crosses a threshold you set. Ten dollars is a good first number because it is high enough to let real work finish and low enough that a runaway loop gets caught in the first hour. It is ugly and it works and it has saved me from another $437 morning more than once.
The managed-agent category is two weeks old in its current shape and the FinOps layer under it is weeks behind. Cost attribution across Claude, OpenAI, Devin, and Cursor is the next thing that has to exist for any serious team to run agents overnight without a finance conversation in the morning. DD Traces is one attempt at it. If you are running agents and want the trace layer today, grab it at traces.developersdigest.tech. If you are picking which managed agent to run in the first place, the comparison lives at agenthub.developersdigest.tech.
The $437 bill was a cheap lesson. The next one will not be.
AI agents operate in loops - reason, act, observe, repeat. When an agent encounters an error it cannot resolve (like a flaky test or missing environment variable), it may retry the same failing step hundreds of times. Each retry consumes tokens, triggers tool calls, and accumulates session time. Without hard caps, this loop runs until you notice it or until a provider limit kicks in. The problem is compounded by multiple cost dimensions: model tokens, session hours, tool calls, and sandbox compute all bill separately.
Three immediate actions: First, set hard dollar caps in each provider's billing settings (Anthropic, OpenAI, Cursor, Devin - all have usage limits). Second, implement max-iteration caps in your agent harness so that any agent stops after a defined number of steps. Third, set per-tool quotas (max 50 web searches per session, max 100 file reads) and treat tool-call counts as alert metrics. A simple shell script that polls usage APIs every 5 minutes and kills processes at a threshold ($10 is a reasonable starting point) catches runaway loops within the first hour.
There is no unified model. Some providers combine token billing with session-hour style runtime costs and tool-call surcharges. Others use tokens plus sandbox runtime usage. Others price per seat, per task credits, or "agent compute units." Each provider has its own dashboard, its own units, and its own billing cycle - there is no cross-provider cost attribution yet. Use the official sources above for the current unit definitions, then build your own normalized cost model on top of traces.
FinOps (Financial Operations) for AI agents is the practice of monitoring, attributing, and controlling the costs of running autonomous AI systems. It includes unified cost tracking across providers, per-task and per-user attribution, budget alerts, and automated kill switches when spending thresholds are exceeded. The OpenTelemetry trace model provides the foundation - traces, spans, and resource attributes can carry cost data if providers adopt the semantic conventions and teams build enrichment layers that convert tokens and tool calls into dollar figures.
Costs vary dramatically based on task complexity and failure modes. Field reports show Cursor Background Agents averaging $4-5 per "easy" PR. Claude Managed Agents with Opus 4.6 can run $5-25 per session depending on token usage and tool calls. A pathological loop that retries 200 times can easily reach $200-400 in a single overnight session. The unpredictability is the problem - the same task can cost $5 or $50 depending on whether the agent hits an error it cannot escape.
OpenTelemetry (OTel) is a vendor-neutral observability standard for traces, metrics, and logs. For AI agents, it provides a way to track every model call, tool invocation, and session as spans within a trace. The GenAI semantic conventions already define attributes like gen_ai.usage.input_tokens and gen_ai.usage.output_tokens. With OTel tracing enabled (supported by Claude Agent SDK, OpenAI Agents SDK, LangGraph, Vercel AI SDK), you can reconstruct what happened in any session. Cost enrichment layers can then multiply tokens by pricing to attach dollar figures to each span.
It depends on your usage pattern. For high-volume, short tasks, per-seat models like GitHub Copilot may be cheapest. For complex, long-running tasks, token-based pricing with aggressive caching (like OpenAI Agents SDK) can work well. For predictable workloads, ACU-based models like Devin offer cost certainty. The absence of cross-provider attribution makes comparison difficult - the only way to know is to run parallel experiments with cost tracking enabled and measure actual spend per task type.
Today, you cannot do this from any single dashboard. You must open each provider's billing page, correlate timestamps manually, and sum the costs yourself. Tools like DD Traces are building cross-provider cost attribution using OpenTelemetry spans, but the enrichment layer that converts tokens and tool calls to dollars is still in development. The interim solution is to enable OTel tracing on all agents, send spans to a local collector, and build your own cost lookup table until the tooling matures.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Gives AI agents access to 250+ external tools (GitHub, Slack, Gmail, databases) with managed OAuth. Handles the auth and...
View ToolLightweight Python framework for multi-agent systems. Agent handoffs, tool use, guardrails, tracing. Successor to the ex...
View ToolOpen-source terminal coding agent from Moonshot AI. Powered by Kimi K2.5 (1T params, 32B active). 256K context window. A...
View ToolMulti-agent orchestration framework built on the OpenAI Agents SDK. Define agent roles, typed tools, and directional com...
View ToolConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI AgentsWhat MCP servers are, how they work, and how to build your own in 5 minutes.
AI AgentsA complete, citation-backed Claude Code course with setup, prompting systems, MCP, CI, security, cost controls, and capstone workflows.
ai-development
A practical guide to building AI agents with TypeScript using the Vercel AI SDK. Tool use, multi-step reasoning, and rea...

AI agents use LLMs to complete multi-step tasks autonomously. Here is how they work and how to build them in TypeScript.

A deep analysis of what AI coding tools actually cost when you factor in usage patterns, hidden limits, and real-world w...

Two popular frameworks for building AI apps in TypeScript. Here is when to use each and why most Next.js developers shou...

Complete pricing breakdown for every major AI coding tool. Claude Code, Cursor, Copilot, Windsurf, Codex, Augment, and m...

A practical operational guide to Claude Code usage limits in 2026: plan behavior, API key pitfalls, routing choices, and...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.