TL;DR
SWE-Bench has an 81% false-positive problem. FrontierCode replaces it with mergeability as the metric - and the scores are sobering for every AI coding tool on the market.
Read next
Coding agents make code faster than teams can review it. The next advantage is not bigger prompts. It is review systems that force reproduction, small diffs, tests, and receipts.
8 min readPersistent memory for coding agents is trending because every session still starts too cold. The hard part is not saving facts. It is proving recall, freshness, deletion, and rollback under real development pressure.
9 min readHacker News keeps arguing about Claude Code, Codex, skills, MCP, and orchestration. Under the noise, the same four truths keep surfacing: workflows matter more than demos, verification is the bottleneck, skills beat prompts, and orchestration matters more than raw autonomy.
11 min read| Source | Link |
|---|---|
| FrontierCode benchmark announcement | cognition.ai/blog/frontier-code |
| METR SWE-Bench false positive analysis | metr.org/notes/2026-03-10-many-swe-bench-passing-prs-would-not-be-merged-into-main |
| O'Reilly: AI Code Review Only Catches Half of Your Bugs | oreilly.com/radar/ai-code-review-only-catches-half-of-your-bugs |
| Vellum: Claude Fable 5 and Mythos 5 Benchmarks Explained | vellum.ai/blog/claude-fable-5-and-mythos-5-benchmarks-explained |
| Snyk: AI Hallucinations in Code | snyk.io/blog/ai-hallucinations |
Last updated: June 10, 2026
When a model scores 90% on a coding benchmark, does that mean 90% of its code would actually ship? According to Cognition's new FrontierCode evaluation, the answer is closer to: no, and the gap is large enough to matter for your review process today.
FrontierCode flips the benchmark question from "does this pass tests?" to "would a maintainer actually merge this?" The results are the most important benchmark story in months, and they carry real implications for how teams should think about AI code review.
SWE-Bench Verified and SWE-Bench Pro became the default yardsticks for AI coding capability. A model's SWE-Bench score became a proxy for production readiness. The problem is that the yardstick is broken in a specific and measurable way.
METR's analysis of SWE-Bench passing submissions found that high-scoring models routinely produce patches that would not be accepted by the actual project maintainers. Cognition quantifies this in FrontierCode's methodology: their benchmark produces 81% fewer misclassification errors than SWE-Bench Pro.
The misclassification errors come from two directions:
Both types corrupt the signal. A model optimized for SWE-Bench can game both failure modes: exploit test coverage gaps to pass with minimal changes, or pattern-match to the expected output format without actually solving the underlying problem.
SWE-Bench was designed for less capable models. Those models needed tightly specified tasks with deterministic grading. Today's frontier models can handle ambiguous, realistic prompts - which means the benchmark is no longer the limiting factor. The models have outpaced the measurement.
Cognition built FrontierCode around a single question: would the project maintainer merge this PR?
That required going to the source. FrontierCode recruited 20+ world-class open-source developers to build tasks directly from the repositories they maintain. Represented projects include repos with 28,000 to 37,000 GitHub stars. Each maintainer spent more than 40 hours per task - across multiple rounds of review with Cognition researchers - to distill their judgment into concrete evaluation criteria.
The benchmark covers 150 tasks total, organized into three nested difficulty subsets:
That 40-hour-per-task investment means the criteria are durable. The rubric for each task includes blocker criteria (hard stops that must pass for the solution to count) and non-blocker quality signals (style, type safety, readability). A solution that passes all blockers receives a weighted aggregate score across all rubric items. A solution that fails any blocker gets a zero.
The evaluation methodology is also meaningfully different from prior benchmarks:
| Method | What it checks |
|---|---|
| Classical unit tests | Behavioral correctness |
| Reverse-classical tests | Agent's tests must fail on the broken base commit - confirming the agent understood the problem |
| Adaptive classical grading | LLM-patched tests that handle valid alternative implementations |
| Scope checks | Diff size, file boundaries, semantic locality |
| Prompt-based LLM grading | Code quality, codebase conventions, design patterns |
The reverse-classical method is particularly interesting. It runs the agent's own test suite against the original, unfixed codebase. If the tests do not fail on the broken code, the agent did not understand the problem well enough to test for it. This catches a common failure mode: agents that write tests designed to pass rather than tests designed to verify.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Jun 10, 2026 • 7 min read
Jun 10, 2026 • 7 min read
Jun 10, 2026 • 7 min read
Jun 10, 2026 • 7 min read
FrontierCode Diamond scores show the current state of the field clearly:
| Model | Diamond Score |
|---|---|
| Claude Opus 4.8 | 13.4% |
| GPT-5.5 | 6.3% |
| Gemini 3.1 Pro | 4.7% |
| GPT-5.4-mini | 4.6% |
| Claude Sonnet 4.6 | 3.5% |
| Kimi K2.6 (best open-source) | 3.8% |
| MiniMax M2.7 | 2.4% |
The best performing model - Claude Opus 4.8 - scores 13.4% on the hardest subset. On FrontierCode Main (100 tasks), Opus 4.8 reaches 34.3%. On Extended (all 150), it reaches 51.8%.
One cost-efficiency note from Cognition's analysis: GPT-5.5 consistently uses up to 4x fewer tokens than Opus 4.8 while scoring 6.3% on Diamond. For teams optimizing cost-per-task rather than raw ceiling, that tradeoff matters.
The open-source gap is notable. Kimi K2.6, the best-performing open-source model, scores 3.8% on Diamond and 16% on Main - well behind the frontier. For teams considering self-hosted models for cost or privacy reasons, these numbers are a realistic baseline for what they are giving up.
The gap between "passes tests" and "gets merged" is not random. FrontierCode Diamond tasks surface a specific category of difficulty that automated tests cannot reliably measure: decisions that require inferring the maintainer's intent.
Cognition's example task illustrates this well. The task asks a model to encapsulate all warning logs behind a new LOG_WARNING() function and use it consistently across the codebase. Claude Opus 4.8 consistently writes code that is behaviorally equivalent but architecturally wrong:
// Opus 4.8 approach - mixes LOG_WARNING() and std::cerr
LOG_WARNING() << "You are opting in to remove schema identifiers...\n";
std::cerr << "The only legit use case...\n";
// Maintainer-preferred approach - chains through LOG_WARNING()
LOG_WARNING() << "You are opting in to remove schema identifiers...\n"
<< "The only legit use case...\n";
Both compile and produce the same output today. But the first implementation bakes in the assumption that LOG_WARNING() and std::cerr are the same stream - which might not hold if LOG_WARNING() is modified later. The maintainer knows this. The model does not.
This is a trust boundary the model cannot infer from the code alone. Mergeability requires understanding:
These are design decisions, not implementation details. And design decisions are exactly where AI coding tools reach their limits.
The FrontierCode scores are not an accident of benchmark design. They reflect a fundamental constraint on what automated code analysis can find.
Andrew Stellman's O'Reilly Radar analysis frames this as the "intent ceiling." Structural analysis tools - linters, static analyzers, AI code reviewers that work without requirements context - can detect implementation bugs: buffer overflows, SQL injection patterns, null pointer dereferences, race conditions. These are pattern-matchable.
But roughly 50% of security defects are not implementation bugs. They are design flaws: missing authorization checks, unspecified trust boundaries, security properties that were never written down. NIST SATE evaluations found that the best static analysis tools plateaued at 50-60% detection rates for security vulnerabilities. A 2024 study by Charoenwet et al. (ISSTA 2024) tested five static analysis tools against 815 real vulnerability-contributing commits and found that 22% of vulnerable commits went entirely undetected.
The pattern is consistent across two decades of research: there is a ceiling on what you can find by analyzing code structure alone, and it is around 50%.
AI code review, as it exists today, works on the same side of that ceiling that static analysis always has. It can ask "does this look right?" - but it cannot ask "does this do what it was supposed to do?" without knowing what it was supposed to do.
FrontierCode Diamond scores in the single digits for most models are a direct expression of this constraint. The hardest 50 tasks are exactly the tasks where intent matters most. You can read more about the category of failures that escape current review workflows in our post on constraint decay in AI coding agents.
These benchmark results do not mean AI coding tools are not useful - they are. They mean the failure modes are more specific than "the AI made a mistake."
For code review workflows today, the FrontierCode results suggest a practical split:
Where current AI review adds reliable value:
Where human review remains essential:
The models scoring at 13-34% on FrontierCode are not randomly failing. They are failing at the second category. If your review process relies on AI to catch things in that second category, the benchmark data suggests you are taking on more risk than the tooling can currently absorb.
One workflow adjustment that the O'Reilly research suggests: feed the AI your intent, not just your code. Design rationale sitting in commit messages, architecture decision records, and issue threads gives the model the context it needs to evaluate whether a PR actually fulfills the requirement - not just whether it compiles and passes tests. For more on how this pattern applies to agent memory, see our post on why agent memory benchmarks are not enough.
Given the current ceiling, the practical approach is not "use AI less" but "use AI for the right half of the problem."
Effective AI-assisted review in 2026 looks like:
You can find more on building review workflows that scale in our post on the AI code review bottleneck.
FrontierCode Diamond is unsaturated. The highest score is 13.4%. That is useful as a signal precisely because there is so much room to improve.
The path from 13% to production-trustworthy requires progress on two fronts:
Better context utilization. Models need to reason about what a codebase is designed to do - not just what it currently does. This means ingesting architecture docs, commit history, design rationale, and issue threads as first-class inputs to code generation and review. The models that will push Diamond scores toward 50% are the ones that can faithfully represent a maintainer's intent from surrounding context.
Reliable scope discipline. The hardest FrontierCode tasks require surgical changes: modify exactly what needs to be modified, nothing else. Current models over-generalize. They refactor while fixing. They clean up adjacent code. Those are good instincts in isolation, but they violate the trust model of a code review. A maintainer reviewing a small bug fix does not want to evaluate an unexpected refactor at the same time.
Verifiable requirements integration. The O'Reilly research points at a more fundamental gap: models need a way to check code against stated intent, not just code against code. That requires either better tooling for surfacing requirements context, or new evaluation methods that test whether a model can detect intent violations from natural-language specifications.
FrontierCode gives the industry a durable target to aim at. The current scores are an honest accounting of where the frontier is. Teams that build their review workflows around that honest accounting will catch more of the things that actually matter.
For related analysis, see constraint decay in AI coding agents, what Hacker News gets right about AI coding agents in 2026, and why agent memory benchmarks are not enough.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Anthropic's agentic coding CLI. Runs in your terminal, edits files autonomously, spawns sub-agents, and maintains memory...
View ToolOpen-source AI pair programming in your terminal. Works with any LLM - Claude, GPT, Gemini, local models. Git-aware ed...
View ToolHigh-performance code editor built in Rust with native AI integration. Sub-millisecond input latency. Built-in assistant...
View ToolOpen-source reasoning models from China. DeepSeek-R1 rivals o1 on math and code benchmarks. V3 for general use. Fully op...
View ToolCompare AI coding agents on reproducible tasks with scored, shareable runs.
View AppEvery coding agent in one window. Stop alt-tabbing between Claude, Codex, and Cursor.
View AppScore every coding agent on your own tasks. Catch regressions in CI.
View AppClickable PR link in the footer with review state color coding.
Claude CodeThe primary command-line entry point for Claude Code sessions.
Claude CodeReal-time prompt loop with history, completions, and multiline input.
Claude Code
Coding agents make code faster than teams can review it. The next advantage is not bigger prompts. It is review systems...

Persistent memory for coding agents is trending because every session still starts too cold. The hard part is not saving...

Hacker News keeps arguing about Claude Code, Codex, skills, MCP, and orchestration. Under the noise, the same four truth...

A new arXiv paper shows coding agents can pass loose backend tasks, then fall apart when architecture, database, and ORM...

The rsync Claude debate shows why teams need reproducible defect forensics before AI attribution becomes a public blame...
Fable 5 launched June 9 at 2x GPT-5.5's price with a 22-point SWE-Bench Pro gap. Here is the decision framework for choo...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.