
TL;DR
A new study from nrehiew quantifies a problem every Claude Code, Cursor, and Codex user has felt: models making huge diffs for tiny fixes. Here is why it happens, why tests do not catch it, and what to do about it.
If you have spent any time with Claude Code, Cursor, Codex, or GitHub Copilot in the past year, you have lived this moment. You point the agent at a simple off-by-one. You ask for the minimal fix. The bug gets fixed. And then you scroll the diff and half the function is gone. A helper you did not request has appeared. A variable has been renamed because the model thought the new name was clearer. Input validation has been bolted onto a path that never needed it. The whole shape of the function has changed.
A new essay from nrehiew gives this failure mode a name and a measurement. The title is "Coding Models Are Doing Too Much." The subtitle is blunt: "Don't rewrite what isn't broken." It is sitting near the top of Hacker News as I write this, with 290 points and 172 comments in the first few hours. The discussion it has kicked off is one that every team shipping with AI agents needs to have.
The author defines over-editing precisely: a model is over-editing if its output is functionally correct but structurally diverges from the original code more than the minimal fix requires.
The demonstration example is brutal. The bug is a single off-by-one: range(len(x) - 1) should be range(len(x)). The minimal fix is one character. GPT-5.4 with high reasoning effort responds by rewriting the entire function. It adds None checks that nobody asked for. It converts arrays with np.asarray and explicit dtype=float. It adds finite-value masking. It validates array sizes. It changes the signature of the curve_fit call. It replaces the plotting logic entirely. The output passes the tests. The output is a disaster to review.
This is the kind of failure that tests cannot catch. If the code is functionally correct, every green check still turns green. Pass@1 stays at 100 percent. The reviewer is the only line of defense, and the reviewer is now staring at fifty changed lines trying to figure out which one fixed the bug and which forty-nine are new surface area to audit.
The default advice for working with AI coding tools is "just write more tests." The logic is that if your tests are good enough, the model cannot ship anything broken past them. That advice is correct but incomplete.
Over-editing is a brown-field failure. The existing code was already understood, was already written the way it was for reasons the team chose deliberately, and was already part of the codebase's shape. The model's job was to fix the bug and nothing else. Instead it made fifty decisions the team did not make and signed your name to them.
Tests verify correctness. They do not verify restraint. They cannot tell you whether the model respected the shape of your code. They cannot tell you whether a refactor snuck in under the cover of a bug fix. They cannot tell you whether the variable names drifted. That verification has to happen in code review, which is already the bottleneck on most teams, and which over-editing makes dramatically more expensive.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
There are three plausible explanations, and my read is that all three are happening at once.
First, training incentive. Most RLHF and preference data rewards thorough, helpful-looking answers. A diff that adds validation, handles edge cases, and improves the function feels more effortful than a one-character change. Annotators give it higher marks. The model learns that big diffs win.
Second, reasoning models are worse. The author calls this out directly. High-reasoning-effort settings make the over-editing problem worse, not better. The model reasons its way into additional changes that feel defensible in chain-of-thought but that were never requested. Reasoning without constraint is expansion without permission.
Third, context loss. Models do not always fully load the context of what the existing code is doing before editing. They regenerate the function from scratch, approximating it, and then merge their approximation back. What looks like a targeted edit is actually a regeneration with drift baked in.
None of those causes go away on their own. The open question is whether we can train for restraint.
The experimental setup is a clean piece of work. Rather than using another LLM to introduce bugs, the author programmatically corrupts 400 problems from BigCodeBench. Corruptions are tiny and mechanical: flipping < to <=, swapping + for -, changing True to False. Each corrupted sample remains syntactically valid and is verified to break the corresponding test cases. The ground-truth edit is therefore exactly the reversal of the corruption and nothing more. The minimal edit is defined by construction.
The metric is token-level Levenshtein distance on the Python tokenizer output, not raw character distance. This matters because a rename from add to someotherfunctionname is character-level huge but token-level tiny. Token-level Levenshtein captures the kind of structural change that actually matters for review.
Crucially, the author measures both the model's output against the ground truth and the model's output against the corrupted input. This gives you a clean signal on how much the model diverged from the minimal edit, independent of whether it got the bug right.
You cannot fix this at the model level. You can work around it at the workflow level, and several of the mitigations are cheap.
First, prompt for restraint explicitly. The instruction "make the minimal change required to fix this bug. Do not rename, refactor, or add validation unless I ask." is not a magic spell, but it measurably reduces over-editing on Claude Code and Cursor in informal testing. Put it in your system prompt or your CLAUDE.md. Do not assume the model defaults to restraint.
Second, review the diff before you review the result. Pass@1 success tells you nothing about how much noise the model produced getting there. Look at the diff size. If the diff is larger than the bug, read it line by line. If the diff is larger than ten lines for a single-character bug, revert and re-prompt.
Third, keep bug-fix commits and refactor commits separate. If the model wants to clean up the function, let it, but in a separate commit that you can review as a refactor. Do not let refactors tunnel under bug fixes into the history. Future you, reading git blame in six months, will thank present you.
Fourth, turn down the reasoning dial. Counterintuitively, lower-reasoning-effort settings often produce cleaner diffs than high-reasoning-effort settings on small bugs. If your agent has a reasoning slider, use it. A maximum-reasoning agent is not always the agent you want on a two-character fix.
Fifth, consider tools that scope the agent's edit surface. The new generation of agent harnesses is experimenting with edit-range constraints, where the agent literally cannot modify lines outside a specified span. That is a more durable solution than prompting and one worth tracking as the primitives mature.
Every coding agent we use today is an optimizer for passing tests. None of them are optimizers for minimal diffs. The benchmarks that the industry reports are pass rates, not edit distances. Until the scoreboards change, the model behavior will not change.
The nrehiew essay is useful because it argues for a second axis on the scoreboard. Correct and minimal, not correct or maximal. If that framing catches on in the next wave of benchmarks, the shape of the agents we build will follow. In the meantime, the restraint has to come from the prompt, the review, and the commit discipline.
If you are shipping with an AI coding agent in production, this is a problem worth understanding now. Not because it blocks the work, but because it quietly makes every code review, every merge conflict, and every history bisect more expensive than it needs to be.
Read the full essay. It is worth the twenty minutes.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Anthropic's agentic coding CLI. Runs in your terminal, edits files autonomously, spawns sub-agents, and maintains memory...
View ToolAI-native code editor forked from VS Code. Composer mode rewrites multiple files at once. Tab autocomplete predicts your...
View Tool
New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
OpenAI's cloud coding agent. Runs in a sandboxed container, reads your repo, executes tasks, and submits PRs. Uses GPT-5...
What MCP servers are, how they work, and how to build your own in 5 minutes.
AI AgentsConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI AgentsInstall Ollama and LM Studio, pull your first model, and run AI locally for coding, chat, and automation - with zero cloud dependency.
Getting Started
Check out Zed here! https://zed.dev In this video, we dive into Zed, a robust open source code editor that has recently introduced the Agent Client Protocol. This new open standard allows...

In this episode, we explore the newly released GPT-5 Codex by OpenAI, a specialized version of GPT-5 designed for agentic coding tasks. Codex offers advanced features, including enhanced code...

Exploring Codex: AI Coding in Terminal In this video, I explore Codex, a new lightweight CLI tool for AI coding that runs in the terminal. This tool, possibly a response to Anthropic's CLI,...
A deep analysis of what AI coding tools actually cost when you factor in usage patterns, hidden limits, and real-world w...
12 AI coding tools across 4 architecture types, compared on pricing, strengths, weaknesses, and best use cases. The defi...
Terminal agent, IDE agent, cloud agent. Three architectures compared - how to decide which fits your workflow, or why yo...