Agent / Evaluation
Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?
** Zhi Chen, Zhensu Sun, Yuling Shi, David Lo, Lingxiao Jiang
Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?
Authors: Zhi Chen, Zhensu Sun, Yuling Shi, David Lo, Lingxiao Jiang
arXiv ID: 2607.01211
Problem: Popular performance-optimization benchmarks for coding agents conflate runtime instability, scoring-rule artifacts, and saturation effects, making leaderboard scores unreliable indicators of true coding-agent progress.
Key Methodology:
- Replayed 740 official reference patches across GSO, SWE-Perf, and SWE-fficiency on 4 Google Cloud machine types to measure cross-machine validity.
- Analyzed ranking disagreements among 8 public submissions shared by GSO and SWE-fficiency and quantified the weight distortion in SWE-fficiency's scoring rule.
- Evaluated 10 public submissions per task to measure benchmark saturation (how many tasks are already solved by at least one submission).
Key Results:
- Only 39/102 GSO tasks, 11/140 SWE-Perf tasks, and 411/498 SWE-fficiency tasks had reference patches that satisfied validity rules across all machine types; SWE-Perf is especially fragile due to near-zero runtime changes.
- Official rankings disagreed on 9 of 28 pairwise submission comparisons across GSO and SWE-fficiency; SWE-fficiency's leaderboard scoring assigned the worst 10 tasks 58.5%–82.8% of the total weight.
- At least one public submission matched or beat the reference patch on 85.3% (384/450) of replay-valid tasks and beat the unoptimized baseline on 99.8% (449/450).
Applied Context: Builders should not treat leaderboard scores from these benchmarks as definitive measures of agent capability; instead, use per-task performance signals and identify settings where the benchmarks still expose meaningful headroom rather than ceiling effects.