
TL;DR
A new arXiv paper shows coding agents can pass loose backend tasks, then fall apart when architecture, database, and ORM constraints pile up. The fix is not longer markdown. It is executable constraints.
Read next
The math of agent pipelines is brutal. 85% reliability per step compounds to about 20% at 10 steps. Here is why long chains collapse in production, and the six patterns the field has converged on to fight the decay.
9 min readA long-running coding agent is only useful if the environment around it can queue tasks, capture logs, checkpoint state, verify behavior, limit cost, and recover from failure.
8 min readA practical architecture for multi-step Claude agents. Loop patterns, state management, error recovery, and the production gotchas that turn a five-step demo into a 20 percent success rate at scale.
11 min readCoding agents are getting good at building the first version of a backend. They are still much worse at obeying the boring rules that make the backend production software.
That is the useful lesson from the new arXiv paper Constraint Decay: The Fragility of LLM Agents in Backend Code Generation. It landed on Hacker News today with a long thread because it names a failure pattern a lot of engineers have felt but not measured cleanly.
The paper's setup is simple: keep the API contract fixed, then layer on structural constraints. Same behavior, more rules. Framework choice. Architecture. Database. ORM. The agent still has to ship a working REST API, but now it has to respect the shape of the system too.
That distinction matters. Most AI coding demos reward "the app runs." Production code also asks "did it use the right layer, the right data boundary, the right database contract, and the right framework idiom?"
If you have been following the DevDigest reliability thread, this sits right next to the agent reliability cliff, long-running agents need harnesses, and agent architecture for multi-step workflows. Agent failure is rarely one thing. It is usually the compounding gap between intent, context, tools, tests, and review.
The paper was submitted on May 7, 2026 by Francesco Dente, Dario Satriani, and Paolo Papotti. It evaluates multi-file backend generation across 80 greenfield generation tasks and 20 feature-implementation tasks spanning eight web frameworks.
The authors use a dual evaluation: end-to-end behavioral tests for the API, plus static verifiers for structural rules. That second part is the reason this paper is worth writing about. It does not stop at "did the endpoint return 200?" It asks whether the implementation honored the architecture, database, and ORM constraints the task required.
Their headline finding: capable configurations lose about 30 points on average in assertion pass rate from unconstrained baseline tasks to fully specified tasks. The paper also reports framework sensitivity. Agents do better in minimal, explicit frameworks like Flask and worse on average in convention-heavy environments like FastAPI and Django.
The root cause analysis is the practical part: data-layer defects dominate. Incorrect query composition and ORM runtime violations show up as leading failure modes.
That sounds narrow until you remember what production backend work usually is. Production backend work is mostly constraints around data.
The wrong takeaway is "coding agents are useless."
The right takeaway is more uncomfortable: natural-language constraints are too soft to carry production architecture by themselves.
A markdown rule like "use clean architecture" is not the same thing as an import-boundary check. "Use PostgreSQL" is not the same thing as a test container that fails if SQLite sneaks in. "Use the ORM layer" is not the same thing as a lint rule, integration test, or repository contract that catches raw query drift.
Agents follow the easiest path through the task. If the task is "make the tests pass," they will optimize for visible behavior. If architecture rules live only in prose, those rules compete with everything else in the context window: the user prompt, the generated files, error logs, package docs, previous failed attempts, and the model's prior assumptions about the framework.
That is constraint decay. The rules are present, but their force fades as the implementation gets longer and the agent has more local problems to solve.
This also explains why AI code review is becoming the bottleneck. Human reviewers are often catching structural drift after the agent already passed a narrow behavior test. That is expensive review work because the code looks complete.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
May 25, 2026 • 7 min read
May 23, 2026 • 8 min read
May 23, 2026 • 7 min read
May 23, 2026 • 8 min read
The Hacker News thread had better pushback than the usual "LLMs do not think" loop.
One strong objection was that the paper did not fully test frontier coding-agent configurations across the whole benchmark because of cost. That matters. If you are using a coding-specialized model with a mature harness, you should not read this as a ranking of every production tool.
Another good point: statically typed codebases may be easier for agents because the compiler becomes a constant verifier. A Go or TypeScript project with strict types gives the agent fast, precise feedback. A loose Python or JavaScript backend can pass early checks while hiding structural drift until runtime.
The most useful comments were about harnesses. Engineers reported better results when they included constraints during planning, linked architecture docs from AGENTS.md, referenced exemplar files, and let the agent run tests and fix failures over multiple rounds.
That matches what we have seen in Forge and local agent reliability: the model is only one part of the system. The harness around the model often decides whether a near-miss becomes a working patch or a silent architecture violation.
There is a fair counterargument: the benchmark may punish agents for not matching a particular implementation style even when the behavior works.
The authors anticipated some of that by using behavioral tests decoupled from internal code structure and conservative static verifiers. They also report that verifier artifacts do not explain away the decay signal. Still, any architecture benchmark has taste baked into it. "Clean architecture" is not a law of physics. Framework idioms differ. Some teams intentionally break layers for performance, simplicity, or migration reasons.
That means the paper should not be used as a cudgel against every agent-generated backend. Use it as a warning about where benchmarks, demos, and real production work diverge.
Loose product demos reward functional completion. Production backends reward constrained completion.
Those are different jobs.
The fix is not a longer instruction file. Longer prose helps only until it becomes more context the model can half-remember.
Use executable constraints.
This is also where parallel coding agents need merge discipline. The more agents you run, the more you need project-level constraints that do not depend on each worker remembering the same paragraph.
Agent frameworks should treat constraint survival as a first-class metric.
The OpenAI Agents SDK, LangGraph, Claude Code, Codex, Cursor, and other agent systems all improve the loop around the model. Tools, traces, permissions, sub-agents, planning modes, and evals matter because they can keep the task anchored when context gets noisy.
But the next useful step is not just better orchestration. It is constraint-aware orchestration.
An agent runtime should know:
That sounds less magical than "agent builds the whole backend." It is also closer to how real teams ship.
Constraint decay is a better phrase than "AI wrote bad code" because it points to a fixable system problem.
The future of coding agents is not raw model output poured into production. It is agents inside development systems that make constraints executable: types, tests, policy checks, architecture rules, generated contracts, review gates, and replayable traces.
Use the model for speed. Use the harness for memory. Use deterministic checks for authority.
That is the post: if your coding agent keeps ignoring architecture, stop trying to prompt harder. Make the architecture something the agent can fail against.
Constraint decay is the pattern where an AI coding agent can satisfy a loose functional task, but loses reliability as architectural, database, ORM, and framework constraints accumulate. The code may look complete while drifting away from production rules.
No. It means backend generation needs stronger harnesses than a prompt and a basic test suite. Agents can be useful when architecture rules, data contracts, tests, and review gates are executable.
Turn important rules into checks. Use strict types, integration tests, import-boundary rules, schema validation, generated clients, architecture tests, and exemplar files. Put the agent in a loop where structural drift fails before human review.
No model should be assumed immune. Better models and stronger coding harnesses help, but production constraints still need verification. The more rules a task has, the more important deterministic checks become.
Yes, but markdown should route the agent to executable truth. Use AGENTS.md or similar files to point at architecture docs, commands, exemplars, and checks. Do not rely on prose alone for rules that must never drift.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Anthropic's agentic coding CLI. Runs in your terminal, edits files autonomously, spawns sub-agents, and maintains memory...
View ToolOpenAI's coding agent for terminal, cloud, IDE, GitHub, Slack, and Linear workflows. Reads repos, edits files, runs comm...
View ToolOpen-source terminal agent runtime with approval modes, rollback snapshots, MCP servers, LSP diagnostics, and a headless...
View ToolCodeium's AI-native IDE. Cascade agent mode handles multi-file edits autonomously. Free tier with generous limits. Stron...
View ToolCompare AI coding agents on reproducible tasks with scored, shareable runs.
View AppSpec out AI agents, run them overnight, wake up to a verified GitHub repo.
View AppEvery coding agent in one window. Stop alt-tabbing between Claude, Codex, and Cursor.
View AppReal-time prompt loop with history, completions, and multiline input.
Claude CodeConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI AgentsWhat MCP servers are, how they work, and how to build your own in 5 minutes.
AI Agents
The math of agent pipelines is brutal. 85% reliability per step compounds to about 20% at 10 steps. Here is why long cha...

A long-running coding agent is only useful if the environment around it can queue tasks, capture logs, checkpoint state,...

A practical architecture for multi-step Claude agents. Loop patterns, state management, error recovery, and the producti...

Coding agents make code faster than teams can review it. The next advantage is not bigger prompts. It is review systems...

Forge hit the Hacker News front page with a strong claim: small local models can become much more useful at tool-calling...

Parallel agents can move faster than one agent, but only when tasks have clean ownership, review receipts, and a merge p...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.