GLM 5.2 Outperforms Claude Code on Semgrep's IDOR Vulnerability Benchmarks

Semgrep's security research team published benchmark results that caught Hacker News's attention: the Chinese open-weight model GLM 5.2 beat Claude Code on IDOR (Insecure Direct Object Reference) vulnerability detection - and did it at roughly one-sixth the cost per finding.

The headline number: GLM 5.2 scored 39% F1 versus Claude Code's 32%, with no scaffolding or multi-agent system. Just a prompt and a model.

The Benchmark Setup

Semgrep tested multiple models on a specific security task: finding IDOR vulnerabilities in real, open-source applications. IDOR is a common web vulnerability where an application exposes internal identifiers (like user IDs or order numbers) without proper authorization checks, letting attackers access other users' data by manipulating those identifiers.

The researchers held several things constant:

The same IDOR dataset (real applications from prior research)
The same evaluation method (F1 scoring)
The same system prompt

What varied was the model and the harness (the wrapper code that orchestrates the model).

The Results

Rank	Model	Harness	F1 Score
1	Semgrep Multimodal (GPT 5.5)	Custom Semgrep	61%
2	Semgrep Multimodal (Opus 4.8)	Custom Semgrep	53%
3	GLM 5.2	Pydantic AI	39%
4	Claude Code (Opus 4.6)	Claude SDK	37%
5	Claude Code (Opus 4.8/4.7)	Claude SDK	28%

The key insight: GLM 5.2 with minimal guidance (just a prompt via Pydantic AI) outperformed Claude Code by 7 points. The cost? Approximately $0.17 per vulnerability found - about one-sixth what frontier models cost.

Semgrep's own multimodal pipeline with GPT 5.5 still wins overall at 61%, but that system includes endpoint discovery, code filtering, and other scaffolding. The comparison shows what raw model capability looks like versus engineered systems.

What HN Is Saying

The thread drew 59+ comments with strong opinions on both sides.

The skeptics called it marketing. Several commenters noted the narrow scope: "It reads like an ad. Secondly these are 'just' IDORs, arguably the easiest class of vulnerabilities. Thirdly it compares to GPT 5.5 and Opus 4.8. No, we don't have Mythos at home."

The critique is valid - Semgrep explicitly noted this evaluates a single task and may not generalize to other vulnerability types like SSRF.

The open-weight advocates pushed back. Multiple commenters argued that the benchmark's limitations don't diminish its value. One wrote: "GLM5.2 is in the room with us, today. Mythos is not. And for us in the EU, it's even more complicated, as Mythos might be with us in the room one day, and go poof the next day, on the whims of political entities that we have 0 control over."

Another: "In my experience, GLM 5.2 is extremely good at finding vulnerabilities, and more importantly, unlike Opus, I've never seen it refuse a command."

The export control discussion emerged. One commenter predicted: "GLM export controls incoming? I predict Commerce will force OpenRouter, HuggingFace to take some open models down within the next few months."

This sparked a thread about the absurdity of the US trying to export-control a Chinese model. Others noted that any such restrictions would only affect American companies while attackers continue using whatever tools they want: "If that happens it'll be an absolute disaster. Imagine a scenario where Anthropic and OpenAI prohibit most US companies from using their latest models because of safety... And meanwhile attackers use equivalent open source models to attack US companies."

The harness vs model distinction came up. A sharp commenter pointed out: "Claude Code is an agent harness, not an LLM. Claude is a brand (or group of models), not an LLM." The benchmark title conflates these - but the article author acknowledged this and argued Claude Code pricing is the best proxy for amortized inference costs.

Practical experiences surfaced. One developer shared weekend results: "I have taken another look on these open models after the fiasco of Fable and GPT 5.6 this weekend and... GLM-5.2 truly is a good workhorse model for daily programming. I consider myself a heavy user of LLMs and a seasoned developer. A typical session for me with GPT is usually over a hundred dollars... Two days later and 20 dollars poorer I have what I need: a multimodal agent written in rust that has access to my homelab."

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

From the archive

OpenAI's June API Updates Are Really a Control-Plane Upgrade

Jun 28, 2026 • 8 min read

Vercel AI SDK 7: The Production Agent Upgrade

Jun 28, 2026 • 9 min read

Grok Build Developer Guide: xAI's Terminal Coding Agent (June 2026)

Jun 27, 2026 • 9 min read

Perplexity Bumblebee: Developer Guide to the Open Source Supply Chain Scanner

Jun 27, 2026 • 7 min read

The Technical Context

Several technical points emerged from both the article and discussion:

GLM 5.2 is massive. At 753 billion parameters, running it locally requires serious hardware. Commenters discussed 8x RTX 6000 setups costing $80-100k. For most developers, API access through providers like Fireworks or OpenRouter makes more sense than local deployment.

The scaffolding gap is real. Semgrep's 61% result with GPT 5.5 includes endpoint discovery, code filtering, and multi-agent orchestration. GLM 5.2's 39% is with essentially zero scaffolding. The question is whether wrapping GLM 5.2 in similar tooling would close that gap.

Safety guardrails may affect results. One commenter noted that Claude Code with Opus 4.8 actually performed worse (28%) than with older Opus versions (37%). This could be due to increased safety restrictions on newer models - a recurring theme where safety training potentially reduces capability on security research tasks.

Self-training loops are emerging. A security researcher noted: "These numbers seem pretty low compared to what I was able to achieve specifically around windows kernel... GLM 5.2 is already capable enough to assist in self-training which is similar to what we saw happen with frontier models and they appear to be getting there at a significantly lower cost than OpenAI/Anthropic."

Why This Matters

The benchmark reveals a few important trends:

Open weights are catching up. Not across the board, not on every task, but on specific workloads - including security-relevant ones - open models now compete with frontier providers. At 39% F1 versus 28-37%, GLM 5.2 isn't just close; it's ahead of Claude Code on this task.

Cost matters for security. At $0.17 per vulnerability versus $1+ for frontier models, the math changes for automated security scanning. You can run six times as many scans for the same budget, or cover six times as much code.

The model vs system distinction is blurring. What beats what depends heavily on the harness. Semgrep's multimodal pipeline with GPT 5.5 destroys everything else at 61%, but that's a product, not a raw model capability. As agentic tooling improves, the "which model wins" question becomes less important than "which system architecture wins."

Regulatory risk is emerging. The thread's discussion of potential export controls on Chinese AI models reflects growing tension between open-source AI development and national security concerns. Whether such controls would be effective (or even enforceable) is debatable, but the fact that people are discussing it signals a shift.

The Bottom Line

Semgrep's benchmark is narrow - one vulnerability type, one evaluation method - but the signal is clear: open-weight models have reached competitive parity on at least some security tasks, at a fraction of frontier model costs.

For security teams doing automated vulnerability scanning, the implication is worth exploring. GLM 5.2 through providers like Fireworks offers a cost-effective alternative that - on this specific task - outperforms Claude Code.

For the broader AI development community, it's another data point in the ongoing debate about open versus closed models. The capability gap that justified frontier model pricing is narrowing faster than some expected.

The Benchmark Setup

The Results

What HN Is Saying

OpenAI's June API Updates Are Really a Control-Plane Upgrade

Vercel AI SDK 7: The Production Agent Upgrade

Grok Build Developer Guide: xAI's Terminal Coding Agent (June 2026)

Perplexity Bumblebee: Developer Guide to the Open Source Supply Chain Scanner

The Technical Context

Why This Matters

The Bottom Line

Sources

Vulnerability Reports Are Not Special Anymore

Using Claude Code for a Second Opinion on MRI Scans - What Actually Happened

Unlimited OCR: Baidu's Open-Source Solution for Long Document Parsing

Related Tools

Claude Code

Zed

DeepSeek

Cline

Apps from Developers Digest

Skills Pro

Hookyard Pro

AI Models

Related Guides

Claude Code Complete Course

Claude Code Setup Guide

Getting Started with Claude Code

Related Videos

Open Design: Turn Websites into Design Assets for Cursor & Claude Code

OpenAI Codex in 7 Minutes

Nimbalyst: The Open-Source Visual Workspace for Building with Codex and Claude Code

Related Posts

Vulnerability Reports Are Not Special Anymore

Unlimited OCR: Baidu's Open-Source Solution for Long Document Parsing

Using Claude Code for a Second Opinion on MRI Scans - What Actually Happened

Developer Fired by Google for Building Google Workspace CLI

TikZ Editor Is a WYSIWYG LaTeX Figure Tool Built Almost Entirely by Codex

Apertus: Europe's Answer to AI Sovereignty - and Why HN Is Skeptical

Get Smarter About AI Dev

The Benchmark Setup

The Results

What HN Is Saying

OpenAI's June API Updates Are Really a Control-Plane Upgrade

Vercel AI SDK 7: The Production Agent Upgrade

Grok Build Developer Guide: xAI's Terminal Coding Agent (June 2026)

Perplexity Bumblebee: Developer Guide to the Open Source Supply Chain Scanner

The Technical Context

Why This Matters

The Bottom Line

Sources

Vulnerability Reports Are Not Special Anymore

Using Claude Code for a Second Opinion on MRI Scans - What Actually Happened

Unlimited OCR: Baidu's Open-Source Solution for Long Document Parsing

Related Tools

Claude Code

Zed

DeepSeek

Cline

Apps from Developers Digest

Skills Pro

Hookyard Pro

AI Models

Related Guides

Claude Code Complete Course

Claude Code Setup Guide

Getting Started with Claude Code

Related Videos

Open Design: Turn Websites into Design Assets for Cursor & Claude Code

OpenAI Codex in 7 Minutes

Nimbalyst: The Open-Source Visual Workspace for Building with Codex and Claude Code

Related Posts

Vulnerability Reports Are Not Special Anymore

Unlimited OCR: Baidu's Open-Source Solution for Long Document Parsing

Using Claude Code for a Second Opinion on MRI Scans - What Actually Happened

Developer Fired by Google for Building Google Workspace CLI

TikZ Editor Is a WYSIWYG LaTeX Figure Tool Built Almost Entirely by Codex

Apertus: Europe's Answer to AI Sovereignty - and Why HN Is Skeptical

Get Smarter About AI Dev