Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

Authors: Zhi Chen, Zhensu Sun, Yuling Shi, David Lo, Lingxiao Jiang

arXiv ID: 2607.01211

Problem: Popular performance-optimization benchmarks for coding agents conflate runtime instability, scoring-rule artifacts, and saturation effects, making leaderboard scores unreliable indicators of true coding-agent progress.

Key Methodology:

Replayed 740 official reference patches across GSO, SWE-Perf, and SWE-fficiency on 4 Google Cloud machine types to measure cross-machine validity.
Analyzed ranking disagreements among 8 public submissions shared by GSO and SWE-fficiency and quantified the weight distortion in SWE-fficiency's scoring rule.
Evaluated 10 public submissions per task to measure benchmark saturation (how many tasks are already solved by at least one submission).

Key Results:

Only 39/102 GSO tasks, 11/140 SWE-Perf tasks, and 411/498 SWE-fficiency tasks had reference patches that satisfied validity rules across all machine types; SWE-Perf is especially fragile due to near-zero runtime changes.
Official rankings disagreed on 9 of 28 pairwise submission comparisons across GSO and SWE-fficiency; SWE-fficiency's leaderboard scoring assigned the worst 10 tasks 58.5%–82.8% of the total weight.
At least one public submission matched or beat the reference patch on 85.3% (384/450) of replay-valid tasks and beat the unoptimized baseline on 99.8% (449/450).

Applied Context: Builders should not treat leaderboard scores from these benchmarks as definitive measures of agent capability; instead, use per-task performance signals and identify settings where the benchmarks still expose meaningful headroom rather than ceiling effects.