
TL;DR
Five managed-agent providers, five pricing models, zero unified cost attribution. If you're running agents overnight, you need FinOps you don't have yet.
I woke up to a $437 bill from an agent I asked to write a TypeScript refactor.
The task was not small but it was not $437 big. Port a mid-sized service from a bespoke event emitter to a typed pub/sub primitive, update the tests, open a PR. I spun it up around 11pm, watched the first few steps, saw it start editing the right files, and went to bed. When I opened my laptop at 7am there was no PR. There was a still-running session, a loop counter past 200, and a billing dashboard that made me close the tab and open it again to make sure I was reading it right.
What actually happened was mundane. The agent hit a failing test, tried to fix it, broke a neighboring test, tried to fix that, went back to the first one, and kept oscillating. Each pass retried a web search. Each pass re-read the file tree. There was no per-session cap. There was no cross-provider ceiling. There was no kill switch attached to dollars. There was only the assumption, inherited from a decade of SaaS, that the bill at the end of the month would be roughly the bill I expected at the start of it. Managed agents broke that assumption and nobody has rebuilt it yet.
The managed-agent category just crystallized. Anthropic launched Claude Managed Agents on April 9, 2026 with a pricing surface that combines model tokens, a session-hour charge of $0.08/hour, and tool surcharges on top (web search is billed at $10 per 1,000 calls). Six days later, OpenAI shipped its Agents SDK update with a different shape: standard API tokens plus $0.03/GB for hosted sandbox sessions, with sandbox providers ranging across Blaxel, Cloudflare, Daytona, E2B, Modal, Runloop, and Vercel. That is two frontier providers, two pricing models, already incompatible before you leave the first tier.
Past that tier it fractures more. Devin prices in ACUs, where one ACU is roughly 15 agent-minutes and plans scale from $20 pay-as-you-go to $500 for 250 ACUs. Cursor Background Agents ride on a Pro subscription with metered Max-mode overflow, and field reports are landing around $4.63 per "easy" PR. GitHub Copilot charges per-seat with 300 premium requests bundled and $0.04 for every overflow request. Replit has effort-based credits with Economy, Power, and Turbo tiers. Jules is free up to 15 tasks a day, then tiered. Factory is token-based with no per-seat floor.
Five pricing models (tokens, session-hour, per-task/ACU, per-seat with overflow, outcome-based) layered with infinite hybrids. No unified attribution. Every provider shows you their own dashboard, their own units, their own refresh cadence. When an overnight run goes sideways, piecing together what actually happened requires opening five browser tabs, correlating timestamps by hand, and trusting that each dashboard updated before you looked at it. The category grew up faster than its observability did, and FinOps is the thing that is missing.
In my own logs and in enough war stories from other teams to call it a pattern, three specific failure modes account for almost every overnight blowup.
1. Pathological loops. The agent retries the same broken test two hundred times. This is what happened to me. It usually starts with a test that fails for a reason the agent cannot see (an environment variable, a flaky external call, a test depending on order), and the agent interprets the red bar as a code problem. It edits, reruns, edits, reruns. On Claude Managed Agents at $0.08/session-hour plus Opus 4.6 tokens at $5 input and $25 output per million, eight hours of this can clear $200 on tokens alone before you count the session fee. Prevention: every agent run gets a max-iterations cap in the harness, every test failure that recurs more than three times triggers an escalation to a human, and every session gets a hard dollar ceiling that kills the process rather than "warns" the user.
2. Tool call explosion. The agent decides it needs context and calls web search in a hot loop. Claude Managed Agents bills web search at $10 per 1,000 queries. It takes one broken retry pattern to rack up 3,000 searches across a single task, which is $30 in tool calls on top of whatever the model tokens cost. I have seen a Cursor Background Agent run up 400 MCP calls in a session that should have needed six. Prevention: set per-tool quotas at the harness level (max 50 web searches per session, max 100 file reads, max 10 shell executions), log every tool call with its cost, and treat tool-call count as a first-class alert metric.
3. Context swapping. The agent keeps re-reading the same file tree. This one is quiet. Each pass through a task, the agent reloads the project structure, rereads five or ten files it has already seen, and pushes them into a fresh context window. On a 1M-context model like GPT-5.4, this is cheap in wall time but expensive in tokens because you are sending 300K input tokens per iteration and compaction does not always kick in the way you want. Ten iterations at 300K input tokens each is 3M tokens per session, and on GPT-5.3-Codex at $1.75 input per million that is $5 per run, per agent, before you add output. Run five agents in parallel overnight and you spent $25 on file rereads. Prevention: force the harness to use compaction aggressively, cache file contents with hash keys across iterations, and instrument input-token growth per step so that a reread spike is visible before the bill is.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
No one has cross-provider cost attribution. Say that again with the weight it deserves. You can go to Anthropic's dashboard and see tokens and session hours. You can go to OpenAI's dashboard and see tokens and sandbox storage. You can go to Devin and see ACUs. You can go to Cursor and see Max-mode overflow. You cannot go to any dashboard today and see a single task called "refactor the event emitter" with the total cost across the two frontier providers, the one sandbox vendor, and the one web-search tool it touched. That span does not exist.
What you need is straightforward to describe and has been refused by the market for eighteen months. You need unified spans tagged per-agent, per-user, per-task, with model tokens, session time, tool calls, and dollar cost attached to each span. You need parent-child span relationships so that a task like "run the test suite" groups its twenty tool calls under a single parent, and a larger task like "ship this PR" groups the test-suite span under itself. You need the ability to filter by provider, by model, by user, and by task and get the exact dollar number out.
This is the OpenTelemetry trace model. OTel already has the vocabulary: traces, spans, resource attributes, semantic conventions. The GenAI semantic conventions already define gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, and gen_ai.usage.output_tokens. What is missing is adoption by managed-agent providers, a coherent cost enrichment layer sitting on top of those spans, and a receiver that holds the data close enough to you that you can ask questions across providers without waiting for each vendor's BI team to ship a feature.
DD Traces was built for exactly this gap. Today it does local OTel trace collection. The minimal OTLP receiver shipped last week and already ingests spans from Claude Code, Cursor, Codex, Augment, Gemini, and Kimi sessions running on your machine. You point an OTLP exporter at localhost:4318, you run your agents, and you get a unified view of what ran where.
Cost attribution is the next step and I am not going to pretend it is done. The piece that exists is the trace plumbing and the cross-tool schema. The piece that is TODO is the enrichment layer that reads a span's provider and model, looks up the current pricing, multiplies the tokens and tool calls, and attaches a dollar figure as a span attribute. That layer is in the roadmap, not in main, and the honest framing is: we have the pipes, we have the schema, we do not yet have the pricing table plugged in or the budget-ceiling kill switch wired up.
Where we are thinking next: a pricing table for Claude Managed Agents, OpenAI Agents SDK, Devin, Cursor, and Copilot that updates weekly. A dollars_spent attribute enriched onto every span at ingest time. A dashboard view that filters by provider, model, user, and task. A budget alert that fires when a single trace crosses a ceiling, with a follow-on kill hook that can signal the harness to stop. That is what FinOps for managed agents looks like and it is what DD Traces is building toward.
If you want to watch that happen, the project is at traces.developersdigest.tech and the receiver is MIT licensed. Opinions on semantic conventions, pricing sources, and kill-hook design are welcome and frankly necessary.
You do not need DD Traces to protect yourself this week. Three concrete things, in order.
1. Set hard dollar caps in each provider's UI. Anthropic, OpenAI, Cursor, Devin, Copilot, and Replit all have usage limits under billing settings. Set them now, before the next overnight run. A cap that lives in the provider dashboard will not save you from bad code but it will save you from the worst version of "forgot to set a cap." Set it per-workspace, set it per-key, set it aggressive (a 2x your expected monthly spend is fine, 10x is not).
2. Run agents with OTel tracing turned on. Every major agent framework now supports OTLP export: Claude Agent SDK, OpenAI Agents SDK, LangGraph, Mastra, Vercel AI SDK. Set OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 and send spans to any collector (DD Traces, Langfuse local, a plain file sink). Even without cost enrichment, having the traces on disk means you can reconstruct what happened when the bill arrives.
3. Write a $10 sentinel loop. The top action, the one that matters most if you do nothing else tonight: a shell script or cron job that polls each provider's usage API every 5 minutes, sums the spend since session start, and kills the agent process if it crosses a threshold you set. Ten dollars is a good first number because it is high enough to let real work finish and low enough that a runaway loop gets caught in the first hour. It is ugly and it works and it has saved me from another $437 morning more than once.
The managed-agent category is two weeks old in its current shape and the FinOps layer under it is weeks behind. Cost attribution across Claude, OpenAI, Devin, and Cursor is the next thing that has to exist for any serious team to run agents overnight without a finance conversation in the morning. DD Traces is one attempt at it. If you are running agents and want the trace layer today, grab it at traces.developersdigest.tech. If you are picking which managed agent to run in the first place, the comparison lives at agenthub.developersdigest.tech.
The $437 bill was a cheap lesson. The next one will not be.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Gives AI agents access to 250+ external tools (GitHub, Slack, Gmail, databases) with managed OAuth. Handles the auth and...
View ToolLightweight Python framework for multi-agent systems. Agent handoffs, tool use, guardrails, tracing. Successor to the ex...
View Tool
New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
Open-source terminal coding agent from Moonshot AI. Powered by Kimi K2.5 (1T params, 32B active). 256K context window. A...
Configure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI AgentsWhat MCP servers are, how they work, and how to build your own in 5 minutes.
AI AgentsStep-by-step guide to building an MCP server in TypeScript - from project setup to tool definitions, resource handling, testing, and deployment.
AI Agents
Check out Trae here! https://tinyurl.com/2f8rw4vm In this video, we dive into @Trae_ai a newly launched AI IDE packed with innovative features. I provide a comprehensive demonstration...

Boost Your Productivity with Augment Code's Remote Agent Feature Sign up: https://www.augment.new/ In this video, learn how to utilize Augment Code's new remote agent feature within your...

In this video, I demonstrate how to use VectorShift to build AI applications and workflows. By applying ideas from Anthropic's blog post 'Building Effective Agents,' I show you how to create...

The math of agent pipelines is brutal. 85% reliability per step compounds to about 20% at 10 steps. Here is why long cha...

From single-agent baselines to multi-level hierarchies, these are the seven patterns for wiring AI agents together in pr...

Four agents, same tasks. Honest trade-offs from a developer shipping production apps with all of them.