
TL;DR
Semgrep's security research team benchmarked LLMs on IDOR vulnerability detection. The open-weight GLM 5.2 beat Claude Code by 7 points at roughly one-sixth the cost.
Semgrep's security research team published benchmark results that caught Hacker News's attention: the Chinese open-weight model GLM 5.2 beat Claude Code on IDOR (Insecure Direct Object Reference) vulnerability detection - and did it at roughly one-sixth the cost per finding.
The headline number: GLM 5.2 scored 39% F1 versus Claude Code's 32%, with no scaffolding or multi-agent system. Just a prompt and a model.
Semgrep tested multiple models on a specific security task: finding IDOR vulnerabilities in real, open-source applications. IDOR is a common web vulnerability where an application exposes internal identifiers (like user IDs or order numbers) without proper authorization checks, letting attackers access other users' data by manipulating those identifiers.
The researchers held several things constant:
What varied was the model and the harness (the wrapper code that orchestrates the model).
| Rank | Model | Harness | F1 Score |
|---|---|---|---|
| 1 | Semgrep Multimodal (GPT 5.5) | Custom Semgrep | 61% |
| 2 | Semgrep Multimodal (Opus 4.8) | Custom Semgrep | 53% |
| 3 | GLM 5.2 | Pydantic AI | 39% |
| 4 | Claude Code (Opus 4.6) | Claude SDK | 37% |
| 5 | Claude Code (Opus 4.8/4.7) | Claude SDK | 28% |
The key insight: GLM 5.2 with minimal guidance (just a prompt via Pydantic AI) outperformed Claude Code by 7 points. The cost? Approximately $0.17 per vulnerability found - about one-sixth what frontier models cost.
Semgrep's own multimodal pipeline with GPT 5.5 still wins overall at 61%, but that system includes endpoint discovery, code filtering, and other scaffolding. The comparison shows what raw model capability looks like versus engineered systems.
The thread drew 59+ comments with strong opinions on both sides.
The skeptics called it marketing. Several commenters noted the narrow scope: "It reads like an ad. Secondly these are 'just' IDORs, arguably the easiest class of vulnerabilities. Thirdly it compares to GPT 5.5 and Opus 4.8. No, we don't have Mythos at home."
The critique is valid - Semgrep explicitly noted this evaluates a single task and may not generalize to other vulnerability types like SSRF.
The open-weight advocates pushed back. Multiple commenters argued that the benchmark's limitations don't diminish its value. One wrote: "GLM5.2 is in the room with us, today. Mythos is not. And for us in the EU, it's even more complicated, as Mythos might be with us in the room one day, and go poof the next day, on the whims of political entities that we have 0 control over."
Another: "In my experience, GLM 5.2 is extremely good at finding vulnerabilities, and more importantly, unlike Opus, I've never seen it refuse a command."
The export control discussion emerged. One commenter predicted: "GLM export controls incoming? I predict Commerce will force OpenRouter, HuggingFace to take some open models down within the next few months."
This sparked a thread about the absurdity of the US trying to export-control a Chinese model. Others noted that any such restrictions would only affect American companies while attackers continue using whatever tools they want: "If that happens it'll be an absolute disaster. Imagine a scenario where Anthropic and OpenAI prohibit most US companies from using their latest models because of safety... And meanwhile attackers use equivalent open source models to attack US companies."
The harness vs model distinction came up. A sharp commenter pointed out: "Claude Code is an agent harness, not an LLM. Claude is a brand (or group of models), not an LLM." The benchmark title conflates these - but the article author acknowledged this and argued Claude Code pricing is the best proxy for amortized inference costs.
Practical experiences surfaced. One developer shared weekend results: "I have taken another look on these open models after the fiasco of Fable and GPT 5.6 this weekend and... GLM-5.2 truly is a good workhorse model for daily programming. I consider myself a heavy user of LLMs and a seasoned developer. A typical session for me with GPT is usually over a hundred dollars... Two days later and 20 dollars poorer I have what I need: a multimodal agent written in rust that has access to my homelab."
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Jun 28, 2026 • 8 min read
Jun 28, 2026 • 9 min read
Jun 27, 2026 • 9 min read
Jun 27, 2026 • 7 min read
Several technical points emerged from both the article and discussion:
GLM 5.2 is massive. At 753 billion parameters, running it locally requires serious hardware. Commenters discussed 8x RTX 6000 setups costing $80-100k. For most developers, API access through providers like Fireworks or OpenRouter makes more sense than local deployment.
The scaffolding gap is real. Semgrep's 61% result with GPT 5.5 includes endpoint discovery, code filtering, and multi-agent orchestration. GLM 5.2's 39% is with essentially zero scaffolding. The question is whether wrapping GLM 5.2 in similar tooling would close that gap.
Safety guardrails may affect results. One commenter noted that Claude Code with Opus 4.8 actually performed worse (28%) than with older Opus versions (37%). This could be due to increased safety restrictions on newer models - a recurring theme where safety training potentially reduces capability on security research tasks.
Self-training loops are emerging. A security researcher noted: "These numbers seem pretty low compared to what I was able to achieve specifically around windows kernel... GLM 5.2 is already capable enough to assist in self-training which is similar to what we saw happen with frontier models and they appear to be getting there at a significantly lower cost than OpenAI/Anthropic."
The benchmark reveals a few important trends:
Open weights are catching up. Not across the board, not on every task, but on specific workloads - including security-relevant ones - open models now compete with frontier providers. At 39% F1 versus 28-37%, GLM 5.2 isn't just close; it's ahead of Claude Code on this task.
Cost matters for security. At $0.17 per vulnerability versus $1+ for frontier models, the math changes for automated security scanning. You can run six times as many scans for the same budget, or cover six times as much code.
The model vs system distinction is blurring. What beats what depends heavily on the harness. Semgrep's multimodal pipeline with GPT 5.5 destroys everything else at 61%, but that's a product, not a raw model capability. As agentic tooling improves, the "which model wins" question becomes less important than "which system architecture wins."
Regulatory risk is emerging. The thread's discussion of potential export controls on Chinese AI models reflects growing tension between open-source AI development and national security concerns. Whether such controls would be effective (or even enforceable) is debatable, but the fact that people are discussing it signals a shift.
Semgrep's benchmark is narrow - one vulnerability type, one evaluation method - but the signal is clear: open-weight models have reached competitive parity on at least some security tasks, at a fraction of frontier model costs.
For security teams doing automated vulnerability scanning, the implication is worth exploring. GLM 5.2 through providers like Fireworks offers a cost-effective alternative that - on this specific task - outperforms Claude Code.
For the broader AI development community, it's another data point in the ongoing debate about open versus closed models. The capability gap that justified frontier model pricing is narrowing faster than some expected.
Read next
Filippo Valsorda argues that LLMs have ended the era of treating security researchers with kid gloves. When anyone can discover vulnerabilities with an AI, the old coordinated disclosure model breaks down.
7 min readA developer fed 266MB of DICOM MRI data to Claude Code Opus for a second opinion on a shoulder diagnosis. The AI disagreed with the doctor. HN radiologists weighed in.
7 min readBaidu releases Unlimited OCR, an open-source vision-language model that parses 100+ page documents in a single pass without memory blowup. Here's what developers need to know.
6 min readTechnical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Anthropic's agentic coding CLI. Runs in your terminal, edits files autonomously, spawns sub-agents, and maintains memory...
View ToolHigh-performance code editor built in Rust with native AI integration. Sub-millisecond input latency. Built-in assistant...
View ToolOpen-source reasoning models from China. DeepSeek-R1 rivals o1 on math and code benchmarks. V3 for general use. Fully op...
View ToolOpen-source autonomous coding agent inside VS Code. Creates files, runs commands, and can use a browser for UI testing a...
View ToolUnlock pro skills and share private collections with your team.
View AppPro hooks for Claude Code. Private bundles, team sync, one-click install.
View AppPick a model in 30 seconds. Built for the answer, not the marketing.
View AppA complete, citation-backed Claude Code course with setup, prompting systems, MCP, CI, security, cost controls, and capstone workflows.
ai-developmentConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI AgentsInstall Claude Code, configure your first project, and start shipping code with AI in under 5 minutes.
Getting Started
Open Design: Open-Source n8n App That Turns Any Website into a Brand Kit, Design System, HTML + Images The video introduces Open Design, an MIT-licensed full-stack template that combines AI and n8n a...

OpenAI Codex Desktop App: Plan/Goal Modes, Plugins, Multi-Agent Workflows & UI Annotation Demo The video showcases OpenAI’s Codex desktop app, which the creator calls OpenAI’s best product and a prem...

Nimbalyst Demo: A Visual Workspace for Codex + Claude Code with Kanban, Plans, and AI Commits Try it: https://nimbalyst.com/ Star Repo Here: https://github.com/Nimbalyst/nimbalyst This video demos N...

Filippo Valsorda argues that LLMs have ended the era of treating security researchers with kid gloves. When anyone can d...

Baidu releases Unlimited OCR, an open-source vision-language model that parses 100+ page documents in a single pass with...

A developer fed 266MB of DICOM MRI data to Claude Code Opus for a second opinion on a shoulder diagnosis. The AI disagre...

Justin Poehnelt spent seven years at Google building open-source developer tools. His CLI went viral, hit #1 on Hacker N...

A developer used OpenAI Codex to build a fully open-source WYSIWYG editor for TikZ figures. The technical approach and r...

Switzerland's fully open foundation model promises transparent training data and EU compliance. The HN crowd has questio...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.