
TL;DR
A new paper shows a 3B parameter model hitting 94.3 on AIME26 and 96.1% on LeetCode contests - matching or exceeding models 100x its size. The catch: it traded general knowledge for pure reasoning ability.
A paper titled "VibeThinker: Spectrum-to-Signal Reasoning in Compact Language Models" is circulating on Hacker News with a headline claim: a 3 billion parameter model achieves 94.3% on AIME 2026 (97.1% with test-time scaling), 80.2% Pass@1 on LiveCodeBench v6, and a 96.1% acceptance rate on unseen LeetCode contests.
Those numbers match or exceed flagship models that are 100x larger. The arXiv paper explains how - and the tradeoffs involved.
VibeThinker-3B is a "compact dense model" designed specifically for verifiable reasoning tasks. The authors use what they call the "Spectrum-to-Signal post-training paradigm" with three components:
The key theoretical contribution is the "Parametric Compression-Coverage Hypothesis": verifiable reasoning can be compressed into compact models, while open-domain knowledge and general-purpose competence require broad parameter coverage.
Translation: you can make a small model very good at math and code if you are willing to sacrifice its ability to answer general knowledge questions.
The discussion has 99 comments and the community is split on what this result means.
The clarifiers: One highly-upvoted comment cuts through the confusion: "Lots of confusion about what this model is actually focused on. It is a cheap specialist for closed-world, verifiable reasoning tasks like math, self-contained coding problems, and similar. 'Closed-world' means the needed information is already in the context. It is not a tool-using agent that can discover missing context."
The skeptics: Several commenters tried standard LLM tests and found the model lacking. One attempted the "pelican on a bicycle SVG" test and got "a rectangle and a black circle." Others pointed out this is expected - the model was not trained on SVG generation or visual reasoning.
The practical testers: One commenter reported success using VibeThinker-3B as a replacement for GPT-5 nano in source code security review, running on an RTX 3090 via vLLM. "It's not great on structured output but I'm working around that in my harness."
The math crowd: Someone tested it on a nasty ODE problem from the Mathematica 15 release notes and "surprisingly it found a valid solution." Running at 25 tok/s on an RTX 2070 Super. Another commenter reproduced the result with the Q4_K_M quantized version at 110 tok/s.
The tool calling concern: Multiple comments noted the model does not support tool calling. "Was not part of its training. It's focused on Python and I think C++ competitive programming and mathematics tasks." This limits its usefulness as an autonomous agent.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Jun 22, 2026 • 5 min read
Jun 22, 2026 • 8 min read
Jun 22, 2026 • 5 min read
Jun 22, 2026 • 6 min read
VibeThinker-3B is small enough to run on consumer hardware. The quantized versions are available on Hugging Face.
Hardware reports from the thread:
The model uses extended thinking by default. One test took 3 minutes 22 seconds to answer a math problem, spending 22k tokens on reasoning before producing the final answer.
A recurring theme in the comments: is this a reasoning module or a standalone model?
One commenter frames it well: "These kinds of models might be more useful as tools to be used by larger orchestrator models, than being the orchestrators themselves."
The lack of tool calling support reinforces this. VibeThinker-3B cannot search the web, cannot look up documentation, cannot call external APIs. It can only reason over what is already in its context window.
For a multi-agent architecture, that might be exactly what you want - a cheap, fast reasoning specialist that handles the math and code verification while a larger model handles orchestration, context gathering, and general knowledge tasks.
Based on the paper and HN discussion, VibeThinker-3B makes sense for:
Competitive programming problems - The 96.1% LeetCode acceptance rate is real, and the model runs fast enough for interactive use.
Math verification - AIME-level competition math, equation solving, and formal verification tasks where the problem is self-contained.
Code security review - As one commenter demonstrated, it works for single-file security analysis where you do not need external context.
Local inference budgets - At 3B parameters, it runs on laptop GPUs. You can have a capable reasoning model without API costs.
It does not make sense for:
General assistant tasks - It traded general knowledge for reasoning capability. Do not ask it about history, current events, or "what is a pelican."
Agentic workflows - No tool calling means no web search, no file system access, no API calls from the model itself.
Multi-file code understanding - It is optimized for self-contained problems. Repository-scale reasoning is not its strength.
VibeThinker-3B is evidence that the "parameter count equals capability" assumption has caveats. A 3B model can outperform 300B+ models on specific benchmarks - if you accept dramatic tradeoffs in generality.
The Parametric Compression-Coverage Hypothesis suggests this is not a fluke. Verifiable reasoning - tasks where you can check the answer programmatically - may compress better than open-ended knowledge. If true, we should expect more specialized small models that excel in narrow domains while failing badly outside them.
The practical implication for developers: model selection is becoming task-specific. The best model for coding might be different from the best model for research might be different from the best model for writing. VibeThinker-3B is a data point in that direction.
On AIME 2026 math competition problems, yes - 94.3% vs lower reported scores for much larger models. On general tasks, no. It traded general knowledge for specialized reasoning capability.
For self-contained coding problems (competitive programming, algorithm implementation, code review), yes. For repository-scale development with tool calling and context management, no - it lacks tool calling support.
At 3B parameters, it competes with Phi-3, Qwen2 3B, and similar. On reasoning benchmarks it dramatically outperforms them. On general knowledge and instruction following, expect comparable or worse performance.
The Q4_K_M version runs at 110 tok/s on an RTX 2070 Super and produces equivalent results on math problems. For reasoning tasks, the quantization appears to preserve capability well.
For batch processing of self-contained math or code problems where you want to avoid API costs, potentially yes. For general development work with tool calling and broad context needs, API models remain more capable.
Read next
Choosing a local coding LLM in 2026 means balancing benchmark performance, hardware cost, and the compliance pressure to keep code off third-party servers. Here is what to run and on what hardware.
8 min readA code-heavy field guide to model routing. Real, runnable-style configs for tiering tasks by complexity, routing simple work to open-weights, reserving frontier models for hard reasoning, building failover chains, and keeping prompt caches warm with OpenRouter, LiteLLM, and Factory Router.
11 min readModern LLMs now use MoE routing, mixed attention variants, and fused vision encoders. The simple transformer stack is gone - here's what replaced it and why it matters for developers.
6 min readTechnical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Anthropic's AI. Opus 4.6 for hard problems, Sonnet 4.6 for speed, Haiku 4.5 for cost. 200K context window. Best coding m...
View ToolAnthropic's flagship reasoning model. Best-in-class for coding, long-context analysis, and agentic workflows. 1M token c...
View ToolOpenAI's flagship. GPT-4o for general use, o3 for reasoning, Codex for coding. 300M+ weekly users. Tasks, agents, web br...
View ToolOpen-source reasoning models from China. DeepSeek-R1 rivals o1 on math and code benchmarks. V3 for general use. Fully op...
View ToolUse opus, sonnet, haiku, and best to switch models easily.
Claude CodeLow, medium, high, xhigh, and max for adaptive reasoning control.
Claude CodeInstall Ollama and LM Studio, pull your first model, and run AI locally for coding, chat, and automation - with zero cloud dependency.
Getting Started
Choosing a local coding LLM in 2026 means balancing benchmark performance, hardware cost, and the compliance pressure to...

A code-heavy field guide to model routing. Real, runnable-style configs for tiering tasks by complexity, routing simple...

Modern LLMs now use MoE routing, mixed attention variants, and fused vision encoders. The simple transformer stack is go...

Unsloth's dynamic quantization makes GLM-5.2 runnable on a 256GB Mac or a 24GB GPU with CPU offloading. Here is the hard...

Switzerland's fully open foundation model promises transparent training data and EU compliance. The HN crowd has questio...

New research from MIT reveals that LLMs identify speakers by writing style, not by tags - meaning attackers who sound li...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.