
TL;DR
New benchmark data shows GPT-5.5 hallucinates 86% of the time when it does not know the answer - versus 28% for the open-weights GLM-5.2. The numbers challenge the assumption that bigger models equal more reliable output.
A new benchmark analysis from Arrow TSX is making the rounds on Hacker News, and the headline number is stark: GPT-5.5 has an 86% hallucination rate on the AA-Omniscience benchmark, versus 28% for Z.ai's MIT-licensed GLM-5.2.
That is a 3x difference in how often these models make up answers when they do not actually know.
The original piece is titled "Bigger models are not the way," and it uses hallucination data as evidence for a broader argument: that scaling up parameter counts is hitting diminishing returns.
Here are the raw hallucination rates from the AA-Omniscience benchmark:
| Model | Hallucination Rate |
|---|---|
| GLM-5.2 | 28% |
| Claude Opus 4.8 | 36% |
| Fable 5 | 48% |
| GPT-5.5 | 86% |
| DeepSeek V4 Pro | 94% |
The methodology works like this: when a model does not know the answer to a question, it can either admit that or make something up. The hallucination rate measures how often it makes something up instead of saying "I don't know."
The researchers tested models with a specific Python asyncio question that contains an architectural impossibility - there is no correct answer because the premise is flawed. They used a "high" reasoning effort setting, temperature 1, and the same system prompt across all models. Models were served on OpenRouter with FP8 precision.
DeepSeek V4 Pro spent 3 minutes 52 seconds reasoning (7.7k tokens) and still produced a "confidently incorrect" answer. GLM-5.2 finished in 12 seconds (799 tokens) and correctly identified that the question described an impossible scenario.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Jun 19, 2026 • 8 min read
Jun 19, 2026 • 8 min read
Jun 19, 2026 • 8 min read
Jun 19, 2026 • 8 min read
The discussion thread is active with 85+ comments, and the conversation is more nuanced than the headline.
Several commenters point out that hallucination rate is a conditional metric. It measures what happens when the model does not know - not the overall probability of encountering a hallucination in typical use. A model that knows more things would have fewer opportunities to hallucinate, even if its conditional hallucination rate is higher.
One commenter laid out the math: "If a model A has 50 correct answers, 20 incorrect answers, and 30 abstentions, its hallucination rate is 40%, while a model with 20 correct answers, 20 incorrect answers, and 60 abstentions has a hallucination rate of 25%, even though it hallucinated exactly the same number of times."
Others argue this framing is too generous to the larger models. If you ask a question the model cannot answer, the only correct response is "I don't know." Making up something plausible is still wrong, regardless of whether the model "knew" the answer somewhere in its weights.
The "you're prompting it wrong" counterargument showed up: one commenter suggested that a well-crafted system prompt explicitly telling the model that "I don't know" is a valid answer could change these numbers significantly. But another pushed back: "You're prompting it wrong is quickly becoming the new 'you're holding it wrong.'"
The thread also picked up on the article's bigger claim about scaling. If larger models and larger training sets have stopped producing proportional improvements, the S-curve thesis has real implications for how these companies are valued. One commenter noted: "That's huge news, considering the valuation of companies like OpenAI and xAI is largely based around the absurd idea of ever increasing scaling from these models."
If you are building agents or workflows that rely on LLM outputs, these numbers translate directly into production reliability. An 86% hallucination rate when the model does not know means that roughly 9 out of 10 times you hit an edge case, you will get confident nonsense instead of a signal that you need to route elsewhere.
The practical takeaways:
Model selection is not just about raw capability. A model that is slightly less capable but much better calibrated - meaning it knows what it does not know - can be more useful in production than a model that is more capable but confidently wrong when it fails.
Open weights are winning on this metric. GLM-5.2's 28% rate is nearly half of Opus 4.8's 36% and a third of GPT-5.5's 86%. This is a meaningful differentiator when you can self-host or route to API providers serving open weights.
System prompt design matters more than we thought. If explicit instructions to admit uncertainty can move these numbers, then the agentic frameworks that bake in those instructions will have real reliability advantages.
The "bigger is better" assumption needs revisiting. The DeepSeek V4 Pro result is particularly striking: 1.6 trillion parameters, nearly 4 minutes of reasoning, and still a 94% hallucination rate. More compute did not help here.
For our model comparison content, see GLM-5.2 vs DeepSeek V4 vs Qwen3 and best AI coding tools for June 2026.
Read next
Auto-installing tree-sitter grammars, built-in markdown mode, window layout commands, and more - the upcoming Emacs release absorbs features that used to require external packages.
6 min readAlex Ellis shares real production experience running local LLMs: $12k hardware investment, 2-3 month ROI, and why treating local models as Opus substitutes misses the point entirely.
7 min readZ.ai's GLM-5.2 lands as a 753B open-weights coding model that beats GPT-5.5 on SWE-bench Pro for roughly one-sixth the per-token cost. Here is the real cost math, a worked cost-per-task example, and a when-to-use-which decision guide.
9 min readTechnical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Open-source reasoning models from China. DeepSeek-R1 rivals o1 on math and code benchmarks. V3 for general use. Fully op...
View ToolDeepSeek's reasoning-first model built for agents. First model to integrate thinking directly into tool use. Ships along...
View Tool
Dan Abramov's explainer on ATProto architecture is making the rounds. The core insight: Bluesky's protocol separates hos...

A deep dive into DuckDB's architecture - columnar storage, vectorized execution, and zero-copy design that lets it compe...

Most developers only know .gitignore, but Git offers two other ignore mechanisms for local workflows and machine-wide pa...

Java's most anticipated performance feature is finally landing. Value classes eliminate object identity overhead and ena...

MCP's new Enterprise-Managed Authorization removes per-user OAuth friction. Anthropic, Okta, Figma, and Linear ship cent...

A YC W25 startup open-sources CADAM, a browser-based tool that converts natural language to parametric OpenSCAD models....

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.