Mercury 2: The LLM That Doesn't Generate Like an LLM

Every LLM you use today is a typewriter. One token at a time, left to right, each keystroke permanent. If the reasoning drifts early, tough luck. It can only move forward.

Mercury 2 is an editor. It starts with a rough draft and sharpens the whole thing with each pass. And it does this at over 1,000 tokens per second.

Inception Labs just shipped the first reasoning model built on diffusion instead of autoregressive generation. The same fundamental approach that already won in image and video generation, now applied to language. And the results are real.

The Speed Problem Nobody Actually Solved#

Remember when Groq hit the scene? Raw inference speed got everyone excited. But the models that could run that fast were limited. They couldn't do tool calling well. They struggled with complex reasoning. Lower benchmark scores across the board. Speed at a real cost.

For model-selection context, compare this with Claude vs GPT for Coding: Which Model Writes Better TypeScript? and OpenAI vs Anthropic in 2026 - Models, Tools, and Developer Experience; the useful question is not only benchmark quality, but where the model fits in a real developer workflow.

The entire industry has been racing to solve this since. OpenAI, NVIDIA, Fireworks, Baseten. Billions spent on better hardware, better kernels, quantization, distillation. Real gains, but all incremental. Everyone squeezing more out of the same autoregressive paradigm.

Mercury 2 took a different path. The speed comes from the model itself, not infrastructure optimization.

Diffusion vs autoregressive generation: typewriter versus editor

How Diffusion LLMs Actually Work#

Autoregressive generation: token one locks before token two begins. Sequential. Permanent. If you make a mistake early, it cascades through everything that follows.

Diffusion generation: start with noise, iteratively refine the entire output in parallel. Multiple tokens per forward pass. Built-in error correction because the model revisits and refines as it goes.

This is actually closer to how humans think. You don't reason word by word. You hold the whole idea, draft, revise, reconsider, then commit. CMU researchers found in September 2025 that diffusion models are "significantly more robust to data repetition" than autoregressive models, especially in data-constrained settings. The academic community is taking this architecture seriously: the LLaDA paper introduced diffusion as a viable alternative to autoregressive text generation and has been gaining traction.

The throughput numbers tell the story:

Model	Output Throughput
Mercury 2	1,009 tok/s
Claude Haiku 4.5	~88 tok/s
GPT-5 mini	~91 tok/s

That's over 10x throughput. On reasoning tasks specifically, 5x faster than speed-optimized autoregressive models. The Mercury 2 figure is from Inception's official announcement, measured on NVIDIA Blackwell GPUs. The comparison numbers are current Artificial Analysis measurements, where Mercury 2 ranks first for output speed across the models they track.

Quality Didn't Get Sacrificed#

Speed without quality is just fast garbage. Mercury 2 holds up:

Benchmark	Mercury 2	GPT-5 mini
AIME 2025	91.1	91.1
GPQA	73.6	Competitive
LiveCodeBench	67.3	Competitive
IFBench	71.3	--
SciCode	38.4	--

Important context: these comparisons are against speed-optimized models, not frontier models. Mercury 2 plays in the speed + reasoning lane. It's not trying to beat Opus on raw intelligence. It's trying to give you reasoning-grade quality at speeds that unlock entirely new application patterns.

Worth noting: Mercury v1 (early 2025) had real limitations. ACI.dev's beta review flagged hallucination issues and a 16K context ceiling. Mercury 2 is a significant leap: 128K context, native tool use, and tunable reasoning. The gap between v1 and v2 is large enough that early criticism doesn't map cleanly to the current model.

Mercury 2 benchmark comparison showing throughput advantage

Newsletter

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools, delivered free every week.

From the archive

Claude Code Worktrees: Parallel Development Without the Chaos

Feb 21, 2026 • 6 min read

Claude Sonnet 4.6: Approaching Opus at Half the Cost

Feb 19, 2026 • 6 min read

Claude Opus 4.6: Anthropic's Smartest Model Gets Agent Teams

Feb 9, 2026 • 8 min read

Why Claude Code Won: Unix Philosophy Meets AI Agents

Jan 19, 2026 • 10 min read

Where 1,000 tok/s Actually Matters#

Three use cases where this speed changes what you can build:

Agent Loops#

Latency compounds across multi-step workflows. Every tool call, every reasoning step adds wait time. In a demo app built for the video, Mercury 2 ran search, scrape, and summarize before most models would finish their first response. Code agents, browser automation, IT triage: more steps, tighter feedback cycles. Skyvern is already using it in production and reports Mercury 2 is "at least twice as fast as GPT-5.2."

Voice and Real-Time#

p95 latency determines if a voice interface feels natural or robotic. Support agents, voice bots, real-time translation. When you need reasoning inside tight SLAs, speed isn't a nice-to-have. Companies like Wispr Flow (real-time transcript cleanup), OpenCall (voice agents), and Happyverse AI (real-time voice/video avatars) are already shipping with Mercury under the hood.

Coding Workflows#

The prompt-review-tweak loop. Rapid succession iteration. The faster the model responds, the more you stay in flow. Zed, the code editor, integrated Mercury and described it as "suggestions land fast enough to feel like part of your own thinking." JetBrains published research arguing diffusion models better match how developers actually work because they edit and refine rather than writing left-to-right.

Drop-In Compatible#

Mercury 2 is OpenAI API compatible. Swap the base URL, model string, and API key. Works with any framework that supports OpenAI's format.

128K context window
Tool use, structured outputs, RAG
Reasoning effort dial: instant, low, medium, high
$0.25/M input tokens, $0.75/M output tokens

That pricing makes it one of the most cost-competitive reasoning models available. For high-volume agent workloads where you're making hundreds of calls per session, the economics are compelling.

Who Built This#

Inception Labs isn't a random startup. CEO Stefano Ermon is a Stanford CS associate professor who co-authored DDIM (the denoising method powering Stable Diffusion and Midjourney). His co-founders Aditya Grover (UCLA) and Volodymyr Kuleshov (Cornell) are both former students. The team includes veterans from DeepMind, Meta, OpenAI, Microsoft, and HashiCorp.

Backed by a $50M round led by Menlo Ventures, with M12 (Microsoft), NVentures (NVIDIA), Snowflake Ventures, and Databricks participating. Individual investors include Andrew Ng and Andrej Karpathy. Fortune 100 companies (unnamed) are already running Mercury in production. Available on Azure AI Foundry.

The people who proved diffusion works for pixels are now proving it works for tokens.

The Bigger Question#

Whether diffusion becomes the future of how all LLMs work is an open question. But the trajectory is clear. Autoregressive generation has a fundamental speed ceiling that no amount of hardware can fully overcome. Diffusion solves that at the model level.

Mercury 2 is the proof point. Fast enough to change what you can build. Cheap enough to actually use at scale. And backed by the people who literally wrote the math.

Try it yourself:

API Platform - start building
Playground - test it live

This article is based on a Developers Digest video sponsored by Inception Labs. All technical claims are sourced from third-party benchmarks and direct testing.

Official Sources#

Resource	Description
Inception Labs Homepage	Company overview and product information
Mercury 2 Announcement	Official launch blog post with technical details
Mercury API Platform	Developer API access and documentation
Mercury Chat Playground	Interactive testing interface
Azure AI Foundry	Enterprise deployment via Microsoft Azure
LLaDA Paper (arXiv)	Large Language Diffusion Models - foundational research
Mercury Technical Report (arXiv)	Mercury: Ultra-Fast Language Models Based on Diffusion
Artificial Analysis: Mercury 2	Independent throughput and price measurements
CMU Diffusion Research	Academic study on diffusion vs autoregressive robustness

FAQ#

What is Mercury 2 and how is it different from other LLMs?#

Mercury 2 is the first commercial reasoning model built on diffusion generation instead of autoregressive generation. While traditional LLMs generate text one token at a time (like a typewriter), Mercury 2 starts with noise and refines the entire output in parallel across multiple passes (like an editor revising a draft). This fundamental architectural difference enables over 1,000 tokens per second throughput - more than 10x faster than models like Claude Haiku or GPT-5 mini.

How fast is Mercury 2 compared to other models?#

Mercury 2 achieves approximately 1,009 tokens per second output throughput, per Inception's announcement on NVIDIA Blackwell GPUs. For comparison, Artificial Analysis currently measures Claude Haiku 4.5 around 88 tok/s and GPT-5 mini around 91 tok/s. On reasoning tasks specifically, Mercury 2 is 5x faster than speed-optimized autoregressive models. This speed comes from the model architecture itself, not just infrastructure optimization.

Does Mercury 2 sacrifice quality for speed?#

No. Mercury 2 maintains competitive benchmark scores: 91.1 on AIME 2025 (matching GPT-5 mini), 73.6 on GPQA, 67.3 on LiveCodeBench, and 71.3 on IFBench. It's designed for the speed + reasoning lane rather than competing with frontier models on raw intelligence, offering reasoning-grade quality at speeds that enable new application patterns.

What is the context window and pricing for Mercury 2?#

Mercury 2 supports a 128K context window (up from 16K in v1), native tool use, structured outputs, and RAG. Pricing is $0.25 per million input tokens and $0.75 per million output tokens, making it one of the most cost-competitive reasoning models available for high-volume agent workloads.

What use cases benefit most from Mercury 2's speed?#

Three primary use cases: (1) Agent loops where latency compounds across multi-step workflows - companies like Skyvern report Mercury 2 is "at least twice as fast as GPT-5.2" for browser automation; (2) Voice and real-time applications where p95 latency determines natural feel - used by Wispr Flow, OpenCall, and Happyverse AI; (3) Coding workflows where rapid prompt-review-tweak iteration keeps developers in flow - integrated by Zed editor.

Is Mercury 2 compatible with existing AI frameworks?#

Yes. Mercury 2 is OpenAI API compatible. You can swap the base URL, model string, and API key to use it with any framework that supports OpenAI's format. It includes a reasoning effort dial (instant, low, medium, high) for tuning speed vs depth tradeoffs.

Who built Mercury 2 and who is backing it?#

Inception Labs was founded by Stanford CS associate professor Stefano Ermon (co-author of DDIM, the denoising method powering Stable Diffusion and Midjourney) along with Aditya Grover (UCLA) and Volodymyr Kuleshov (Cornell). The team includes veterans from DeepMind, Meta, OpenAI, Microsoft, and HashiCorp. Backed by $50M from Menlo Ventures, M12 (Microsoft), NVentures (NVIDIA), Snowflake Ventures, and Databricks, with individual investors including Andrew Ng and Andrej Karpathy.

What improved from Mercury v1 to Mercury 2?#

Mercury v1 (early 2025) had limitations including hallucination issues and a 16K context ceiling. Mercury 2 represents a significant leap: 128K context window, native tool use, tunable reasoning effort, and improved benchmark scores. Early v1 criticism (such as ACI.dev's beta review) doesn't map cleanly to the current model's capabilities.

Watch the Video#