GPT-5.5 Has a 3x Higher Hallucination Rate Than MIT-Licensed GLM-5.2

Developers Digest•June 20, 2026•6 min read

News Hacker News LLMs GPT Benchmarks Open Weights

TL;DR

New benchmark data shows GPT-5.5 hallucinates 86% of the time when it does not know the answer - versus 28% for the open-weights GLM-5.2. The numbers challenge the assumption that bigger models equal more reliable output.

A new benchmark analysis from Arrow TSX is making the rounds on Hacker News, and the headline number is stark: GPT-5.5 has an 86% hallucination rate on the AA-Omniscience benchmark, versus 28% for Z.ai's MIT-licensed GLM-5.2.

That is a 3x difference in how often these models make up answers when they do not actually know.

What the Article Actually Says

The original piece is titled "Bigger models are not the way," and it uses hallucination data as evidence for a broader argument: that scaling up parameter counts is hitting diminishing returns.

Here are the raw hallucination rates from the AA-Omniscience benchmark:

Model	Hallucination Rate
GLM-5.2	28%
Claude Opus 4.8	36%
Fable 5	48%
GPT-5.5	86%
DeepSeek V4 Pro	94%

The methodology works like this: when a model does not know the answer to a question, it can either admit that or make something up. The hallucination rate measures how often it makes something up instead of saying "I don't know."

The researchers tested models with a specific Python asyncio question that contains an architectural impossibility - there is no correct answer because the premise is flawed. They used a "high" reasoning effort setting, temperature 1, and the same system prompt across all models. Models were served on OpenRouter with FP8 precision.

DeepSeek V4 Pro spent 3 minutes 52 seconds reasoning (7.7k tokens) and still produced a "confidently incorrect" answer. GLM-5.2 finished in 12 seconds (799 tokens) and correctly identified that the question described an impossible scenario.

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

From the archive

DuckDB Internals: What Makes It So Fast

Jun 19, 2026 • 8 min read

GitHub Copilot Agent Finder: What ARD Means for Third-Party AI Tools in 2026

Jun 19, 2026 • 8 min read

MCP Goes Stateless: The 2026-07-28 Migration Guide

Jun 19, 2026 • 8 min read

Zero-Touch OAuth Is the MCP Feature Enterprises Were Waiting For

Jun 19, 2026 • 8 min read

What Hacker News Is Saying

The discussion thread is active with 85+ comments, and the conversation is more nuanced than the headline.

Several commenters point out that hallucination rate is a conditional metric. It measures what happens when the model does not know - not the overall probability of encountering a hallucination in typical use. A model that knows more things would have fewer opportunities to hallucinate, even if its conditional hallucination rate is higher.

One commenter laid out the math: "If a model A has 50 correct answers, 20 incorrect answers, and 30 abstentions, its hallucination rate is 40%, while a model with 20 correct answers, 20 incorrect answers, and 60 abstentions has a hallucination rate of 25%, even though it hallucinated exactly the same number of times."

Others argue this framing is too generous to the larger models. If you ask a question the model cannot answer, the only correct response is "I don't know." Making up something plausible is still wrong, regardless of whether the model "knew" the answer somewhere in its weights.

The "you're prompting it wrong" counterargument showed up: one commenter suggested that a well-crafted system prompt explicitly telling the model that "I don't know" is a valid answer could change these numbers significantly. But another pushed back: "You're prompting it wrong is quickly becoming the new 'you're holding it wrong.'"

The thread also picked up on the article's bigger claim about scaling. If larger models and larger training sets have stopped producing proportional improvements, the S-curve thesis has real implications for how these companies are valued. One commenter noted: "That's huge news, considering the valuation of companies like OpenAI and xAI is largely based around the absurd idea of ever increasing scaling from these models."

Why This Matters for Developers

If you are building agents or workflows that rely on LLM outputs, these numbers translate directly into production reliability. An 86% hallucination rate when the model does not know means that roughly 9 out of 10 times you hit an edge case, you will get confident nonsense instead of a signal that you need to route elsewhere.

The practical takeaways:

Model selection is not just about raw capability. A model that is slightly less capable but much better calibrated - meaning it knows what it does not know - can be more useful in production than a model that is more capable but confidently wrong when it fails.

Open weights are winning on this metric. GLM-5.2's 28% rate is nearly half of Opus 4.8's 36% and a third of GPT-5.5's 86%. This is a meaningful differentiator when you can self-host or route to API providers serving open weights.

System prompt design matters more than we thought. If explicit instructions to admit uncertainty can move these numbers, then the agentic frameworks that bake in those instructions will have real reliability advantages.

The "bigger is better" assumption needs revisiting. The DeepSeek V4 Pro result is particularly striking: 1.6 trillion parameters, nearly 4 minutes of reasoning, and still a 94% hallucination rate. More compute did not help here.

For our model comparison content, see GLM-5.2 vs DeepSeek V4 vs Qwen3 and best AI coding tools for June 2026.

Sources

Original article: "Bigger models are not the way"
Hacker News discussion
Artificial Analysis Intelligence Index (benchmark source referenced in article)

Emacs 31 is Around the Corner: The Features Worth Daily Driving

Auto-installing tree-sitter grammars, built-in markdown mode, window layout commands, and more - the upcoming Emacs release absorbs features that used to require external packages.

6 min read

Local Qwen Is a Different Tool, Not a Worse Opus

Alex Ellis shares real production experience running local LLMs: $12k hardware investment, 2-3 month ROI, and why treating local models as Opus substitutes misses the point entirely.

7 min read

GLM-5.2 Cost Math: When Open-Weights Coding Models Actually Save You Money

Z.ai's GLM-5.2 lands as a 753B open-weights coding model that beats GPT-5.5 on SWE-bench Pro for roughly one-sixth the per-token cost. Here is the real cost math, a worked cost-per-task example, and a when-to-use-which decision guide.

9 min read

Share

Suggest an editSave

Discuss this article on Twitter/X

Developers Digest

Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.

300+ videos30K+ GitHub stars50+ articles

Subscribe YouTube GitHub Twitter/X

Related Tools

AI Models

DeepSeek

Open-source reasoning models from China. DeepSeek-R1 rivals o1 on math and code benchmarks. V3 for general use. Fully op...

View Tool

AI Models

DeepSeek V3.2

DeepSeek's reasoning-first model built for agents. First model to integrate thinking directly into tool use. Ships along...

View Tool

Related Guides

Guide

Fast Mode - Claude Code

2.5x faster Opus at a higher token cost (research preview).

Claude Code