Over-Editing: Why Your AI Coding Agent Rewrites What Isn't Broken

Q: What is over-editing in AI coding agents?

A model is over-editing if its output is functionally correct but structurally diverges from the original code more than the minimal fix requires. The bug gets fixed, but a helper you did not request appears, a variable is renamed because the model thought the new name was clearer, and input validation is bolted onto a path that never needed it.

Q: Why cannot tests catch over-editing?

Tests verify correctness. They do not verify restraint. If the code is functionally correct, every green check still turns green and Pass@1 stays at 100 percent. Tests cannot tell you whether the model respected the shape of your code, whether a refactor snuck in under the cover of a bug fix, or whether variable names drifted. That verification has to happen in code review.

Q: Why do AI coding models over-edit?

Three causes are happening at once: training incentive, where RLHF rewards thorough, helpful-looking answers so the model learns that big diffs win; reasoning models are worse, because high-reasoning-effort settings make over-editing worse, not better; and context loss, where models regenerate the function from scratch and merge their approximation back, baking in drift.

Q: How can developers reduce over-editing?

Prompt for restraint explicitly with an instruction like "make the minimal change required to fix this bug. Do not rename, refactor, or add validation unless I ask." Review the diff before you review the result. Keep bug-fix commits and refactor commits separate. Turn down the reasoning dial, since lower-reasoning-effort settings often produce cleaner diffs on small bugs.

Q: Does higher reasoning effort produce better code edits?

No. Counterintuitively, lower-reasoning-effort settings often produce cleaner diffs than high-reasoning-effort settings on small bugs. High-reasoning-effort settings make the over-editing problem worse, because the model reasons its way into additional changes that feel defensible in chain-of-thought but that were never requested. A maximum-reasoning agent is not always the agent you want on a two-character fix.

Official Sources
nrehiew's Minimal Editing Essay	Original research on AI coding models over-editing with token-level Levenshtein metrics
BigCodeBench Benchmark	The coding benchmark used to generate the 400 corrupted samples for measuring edit distance
Claude Code Overview	Anthropic's agentic coding assistant and prompt configuration
Cursor Documentation	AI code editor with multi-file editing and Composer agent
OpenAI Codex Documentation	OpenAI's cloud AI coding agent with GPT-5 models
GitHub Copilot Documentation	GitHub's AI coding assistant and workspace agent features

The bug you asked to fix is one line. The diff is fifty.#

If you have spent any time with Claude Code, Cursor, Codex, or GitHub Copilot in the past year, you have lived this moment. You point the agent at a simple off-by-one. You ask for the minimal fix. The bug gets fixed. And then you scroll the diff and half the function is gone. A helper you did not request has appeared. A variable has been renamed because the model thought the new name was clearer. Input validation has been bolted onto a path that never needed it. The whole shape of the function has changed.

A new essay from nrehiew gives this failure mode a name and a measurement. The title is "Coding Models Are Doing Too Much." The subtitle is blunt: "Don't rewrite what isn't broken." It is sitting near the top of Hacker News as I write this, with 290 points and 172 comments in the first few hours. The discussion it has kicked off is one that every team shipping with AI agents needs to have.

What over-editing actually is#

The author defines over-editing precisely: a model is over-editing if its output is functionally correct but structurally diverges from the original code more than the minimal fix requires.

The demonstration example is brutal. The bug is a single off-by-one: range(len(x) - 1) should be range(len(x)). The minimal fix is one character. GPT-5.4 with high reasoning effort responds by rewriting the entire function. It adds None checks that nobody asked for. It converts arrays with np.asarray and explicit dtype=float. It adds finite-value masking. It validates array sizes. It changes the signature of the curve_fit call. It replaces the plotting logic entirely. The output passes the tests. The output is a disaster to review.

This is the kind of failure that tests cannot catch. If the code is functionally correct, every green check still turns green. Pass@1 stays at 100 percent. The reviewer is the only line of defense, and the reviewer is now staring at fifty changed lines trying to figure out which one fixed the bug and which forty-nine are new surface area to audit.

Why tests cannot save you#

The default advice for working with AI coding tools is "just write more tests." The logic is that if your tests are good enough, the model cannot ship anything broken past them. That advice is correct but incomplete.

Over-editing is a brown-field failure. The existing code was already understood, was already written the way it was for reasons the team chose deliberately, and was already part of the codebase's shape. The model's job was to fix the bug and nothing else. Instead it made fifty decisions the team did not make and signed your name to them.

Tests verify correctness. They do not verify restraint. They cannot tell you whether the model respected the shape of your code. They cannot tell you whether a refactor snuck in under the cover of a bug fix. They cannot tell you whether the variable names drifted. That verification has to happen in code review, which is already the bottleneck on most teams, and which over-editing makes dramatically more expensive.

Newsletter

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools, delivered free every week.

From the archive

Qwen3.6-27B Is the Local Coding Model to Test First

Apr 22, 2026 • 8 min read

7 AI Agent Orchestration Patterns Every Developer Should Know

Apr 22, 2026 • 10 min read

Zed Just Made Parallel AI Agents a Native Editor Primitive

Apr 22, 2026 • 7 min read

Karpathy Skills Show Why CLAUDE.md Is Product Surface Now

Apr 21, 2026 • 8 min read

Why models do it#

There are three plausible explanations, and my read is that all three are happening at once.

First, training incentive. Most RLHF and preference data rewards thorough, helpful-looking answers. A diff that adds validation, handles edge cases, and improves the function feels more effortful than a one-character change. Annotators give it higher marks. The model learns that big diffs win.

Second, reasoning models are worse. The author calls this out directly. High-reasoning-effort settings make the over-editing problem worse, not better. The model reasons its way into additional changes that feel defensible in chain-of-thought but that were never requested. Reasoning without constraint is expansion without permission.

Third, context loss. Models do not always fully load the context of what the existing code is doing before editing. They regenerate the function from scratch, approximating it, and then merge their approximation back. What looks like a targeted edit is actually a regeneration with drift baked in.

None of those causes go away on their own. The open question is whether we can train for restraint.

What the research actually measures#

The experimental setup is a clean piece of work. Rather than using another LLM to introduce bugs, the author programmatically corrupts 400 problems from BigCodeBench. Corruptions are tiny and mechanical: flipping < to <=, swapping + for -, changing True to False. Each corrupted sample remains syntactically valid and is verified to break the corresponding test cases. The ground-truth edit is therefore exactly the reversal of the corruption and nothing more. The minimal edit is defined by construction.

The metric is token-level Levenshtein distance on the Python tokenizer output, not raw character distance. This matters because a rename from add to someotherfunctionname is character-level huge but token-level tiny. Token-level Levenshtein captures the kind of structural change that actually matters for review.

Crucially, the author measures both the model's output against the ground truth and the model's output against the corrupted input. This gives you a clean signal on how much the model diverged from the minimal edit, independent of whether it got the bug right.

What developers should actually do#

You cannot fix this at the model level. You can work around it at the workflow level, and several of the mitigations are cheap.

First, prompt for restraint explicitly. The instruction "make the minimal change required to fix this bug. Do not rename, refactor, or add validation unless I ask." is not a magic spell, but it measurably reduces over-editing on Claude Code and Cursor in informal testing. Put it in your system prompt or your CLAUDE.md. Do not assume the model defaults to restraint.

Second, review the diff before you review the result. Pass@1 success tells you nothing about how much noise the model produced getting there. Look at the diff size. If the diff is larger than the bug, read it line by line. If the diff is larger than ten lines for a single-character bug, revert and re-prompt.

Third, keep bug-fix commits and refactor commits separate. If the model wants to clean up the function, let it, but in a separate commit that you can review as a refactor. Do not let refactors tunnel under bug fixes into the history. Future you, reading git blame in six months, will thank present you.

Fourth, turn down the reasoning dial. Counterintuitively, lower-reasoning-effort settings often produce cleaner diffs than high-reasoning-effort settings on small bugs. If your agent has a reasoning slider, use it. A maximum-reasoning agent is not always the agent you want on a two-character fix.

Fifth, consider tools that scope the agent's edit surface. The new generation of agent harnesses is experimenting with edit-range constraints, where the agent literally cannot modify lines outside a specified span. That is a more durable solution than prompting and one worth tracking as the primitives mature.

The deeper point#

Every coding agent we use today is an optimizer for passing tests. None of them are optimizers for minimal diffs. The benchmarks that the industry reports are pass rates, not edit distances. Until the scoreboards change, the model behavior will not change.

The nrehiew essay is useful because it argues for a second axis on the scoreboard. Correct and minimal, not correct or maximal. If that framing catches on in the next wave of benchmarks, the shape of the agents we build will follow. In the meantime, the restraint has to come from the prompt, the review, and the commit discipline.

If you are shipping with an AI coding agent in production, this is a problem worth understanding now. Not because it blocks the work, but because it quietly makes every code review, every merge conflict, and every history bisect more expensive than it needs to be.

Read the full essay. It is worth the twenty minutes.

Frequently Asked Questions#

What is over-editing in AI coding agents?#

A model is over-editing if its output is functionally correct but structurally diverges from the original code more than the minimal fix requires. The bug gets fixed, but a helper you did not request appears, a variable is renamed because the model thought the new name was clearer, and input validation is bolted onto a path that never needed it.

Why cannot tests catch over-editing?#

Tests verify correctness. They do not verify restraint. If the code is functionally correct, every green check still turns green and Pass@1 stays at 100 percent. Tests cannot tell you whether the model respected the shape of your code, whether a refactor snuck in under the cover of a bug fix, or whether variable names drifted. That verification has to happen in code review.

Why do AI coding models over-edit?#

Three causes are happening at once: training incentive, where RLHF rewards thorough, helpful-looking answers so the model learns that big diffs win; reasoning models are worse, because high-reasoning-effort settings make over-editing worse, not better; and context loss, where models regenerate the function from scratch and merge their approximation back, baking in drift.

How can developers reduce over-editing?#

Prompt for restraint explicitly with an instruction like "make the minimal change required to fix this bug. Do not rename, refactor, or add validation unless I ask." Review the diff before you review the result. Keep bug-fix commits and refactor commits separate. Turn down the reasoning dial, since lower-reasoning-effort settings often produce cleaner diffs on small bugs.

Does higher reasoning effort produce better code edits?#

No. Counterintuitively, lower-reasoning-effort settings often produce cleaner diffs than high-reasoning-effort settings on small bugs. High-reasoning-effort settings make the over-editing problem worse, because the model reasons its way into additional changes that feel defensible in chain-of-thought but that were never requested. A maximum-reasoning agent is not always the agent you want on a two-character fix.

The bug you asked to fix is one line. The diff is fifty.#

What over-editing actually is#

The author defines over-editing precisely: a model is over-editing if its output is functionally correct but structurally diverges from the original code more than the minimal fix requires.

Why tests cannot save you#

Newsletter

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools, delivered free every week.

From the archive

Why models do it#

There are three plausible explanations, and my read is that all three are happening at once.

None of those causes go away on their own. The open question is whether we can train for restraint.

What the research actually measures#

What developers should actually do#

You cannot fix this at the model level. You can work around it at the workflow level, and several of the mitigations are cheap.

The bug you asked to fix is one line. The diff is fifty.#

What over-editing actually is#

Why tests cannot save you#

Qwen3.6-27B Is the Local Coding Model to Test First

7 AI Agent Orchestration Patterns Every Developer Should Know

Zed Just Made Parallel AI Agents a Native Editor Primitive

Karpathy Skills Show Why CLAUDE.md Is Product Surface Now

Why models do it#

What the research actually measures#

What developers should actually do#

The deeper point#

Frequently Asked Questions#