Diffusion Language Models: How Mercury Changed the LLM Speed Game

Official Sources#

Resource	Link
Inception Labs Homepage	inceptionlabs.ai
Mercury Chat Interface	chat.inceptionlabs.ai
Inception Labs Blog	inceptionlabs.ai/blog
Mercury API Documentation	docs.inceptionlabs.ai
Mercury Launch Announcement	inceptionlabs.ai/blog/introducing-mercury
Mercury Technical Report	arxiv.org/abs/2506.17298

A Different Way to Generate Text#

Every large language model you have used works the same way. GPT, Claude, Gemini, Llama, DeepSeek - they are all autoregressive. They generate text one token at a time, left to right, sequentially. Each token requires a full forward pass through billions of parameters, and the next token cannot be generated until the previous one exists. This is why even the fastest LLMs feel slow on long outputs.

For model-selection context, compare this with Claude vs GPT for Coding: Which Model Writes Better TypeScript? and OpenAI vs Anthropic in 2026 - Models, Tools, and Developer Experience; the useful question is not only benchmark quality, but where the model fits in a real developer workflow.

Inception Labs built Mercury to challenge that assumption. Mercury is a diffusion large language model. Instead of generating tokens sequentially, it produces the entire response at once and refines it over multiple iterations, starting from noise and progressively sharpening the output until it reaches a coherent answer.

If you have seen how image generation works with Stable Diffusion or Midjourney, the concept is identical. Those models start with random noise and denoise it step by step until a clear image appears. Mercury applies the same principle to text. The first iteration is nearly unreadable. Each subsequent pass cleans it up, adjusts word choices, fixes structure, and tightens the response until it reads naturally.

Why Speed Is the Headline#

The numbers tell the story. In the launch announcement, Inception Labs measured Mercury Coder Small at 737 tokens per second and Mercury Coder Mini at 1,109 tokens per second on Nvidia H100s. Compare that to GPT-4o Mini at roughly 59 tokens per second, or Claude 3.5 Haiku at 61 tokens per second in the same benchmark table.

That is not a small improvement. Mercury was generating text 10 to 15 times faster than the mainstream alternatives.

In a direct comparison shown during the announcement, a code generation task took ChatGPT 36 seconds to complete and Claude 28 seconds. Mercury finished the same task in 6 seconds. The speed difference is visible and dramatic.

The critical detail is that Mercury achieves these speeds on commodity Nvidia H100 hardware. You do not need specialized inference chips. Previously, the only way to get token generation speeds in this range was through purpose-built hardware from companies like Groq, Cerebras, or SambaNova. Mercury's approach is purely algorithmic. The speedup comes from the architecture, not the silicon.

Inception Labs also noted that their improvements are orthogonal to hardware acceleration, meaning the speedups would compound on faster chips. Running Mercury on Nvidia Blackwell GPUs, for instance, would push the numbers even higher.

How Diffusion Generation Works for Text#

The autoregressive approach has a fundamental constraint: each token depends on every token before it. This makes generation inherently sequential. You cannot parallelize the core generation loop because token N requires tokens 1 through N-1 to exist first.

Diffusion models break this constraint. The process works in three phases:

Initialization - Start with a noisy representation of the entire output. Think of it as a garbled version of the final answer where every position has some text, but most of it is wrong.
Iterative Refinement - A transformer model evaluates the entire noisy output and suggests improvements. Because it looks at the whole sequence simultaneously, it can modify multiple tokens in parallel. Each denoising step makes the output cleaner and more coherent.
Convergence - After enough iterations, the output stabilizes into a clear, natural-language response.

Because the model is not restricted to only considering previous output, Inception Labs argues it has structural advantages for reasoning and response organization. It can see the full context of its own output at every step and adjust any part of it. Autoregressive models commit to each token permanently as they generate it. If token 50 makes token 10 look wrong in retrospect, there is no going back.

Diffusion models can also continually refine their output, which gives them a mechanism for self-correction. If an early iteration introduces a hallucination, later iterations can catch and fix it. This is not guaranteed, but the architecture at least makes it possible.

Newsletter

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools, delivered free every week.

From the archive

xAI Grok 3 Launch: The Smartest AI on Earth?

Feb 18, 2025 • 9 min read

Unstract: Open-Source AI Document Parsing at Scale

Feb 12, 2025 • 10 min read

OpenAI Deep Research: The AI Agent That Does Your Homework

Feb 3, 2025 • 7 min read

ChatGPT Tasks: Scheduled AI Agents Inside ChatGPT

Jan 14, 2025 • 8 min read

Benchmark Performance#

Mercury's first release was not targeting frontier model performance. The comparison set was mid-tier: GPT-4o Mini, Claude 3.5 Haiku, Gemini 2.0 Flash Lite, Qwen, and DeepSeek Mind. Against this group, Mercury held its own.

On HumanEval, the standard code generation benchmark, Mercury scored 88 to 90 depending on the variant. These are strong results for a first-generation model using a fundamentally new architecture. It was not outperforming Sonnet or the full GPT-4o, but it was competitive with the lightweight models from every major lab.

In the Copilot Arena, where real developers evaluated code generation quality in blind tests, Mercury ranked number one for speed and number two for quality. Developers preferred its output over the alternatives when judged without knowing which model produced it.

The benchmark story is one of potential rather than dominance. If the first commercial diffusion LLM matches the quality of established lightweight models while running 10x faster, the trajectory for future versions becomes very interesting.

Why This Architecture Matters for Developers#

The practical implications for application development are significant:

Real-time applications become feasible. At 1,000+ tokens per second, you can generate substantial responses in real time without users noticing any lag. Chat interfaces, code completion, inline suggestions - these all benefit from lower latency.

Inference costs drop. Faster generation on the same hardware means lower cost per token. For high-volume applications where you are processing thousands of requests per minute, the economics shift substantially.

Standard hardware works. You do not need to negotiate access to specialized inference chips or lock into a single hardware vendor. H100s are widely available from every major cloud provider.

Tool use and agentic workflows are supported. Mercury was not limited to simple text generation. The launch materials confirmed support for RAG, tool calling, and agentic workflows, the building blocks of modern AI applications.

The Production Question#

One important caveat from the launch: the speed benchmarks were measured on controlled hardware with controlled load. In production, maintaining those speeds under real traffic is a different challenge.

Claude 3.5 Haiku and GPT-4o Mini run at 60 to 70 tokens per second in production, but those are endpoints handling enormous concurrent demand. The speed is bottlenecked not just by the model but by the infrastructure serving thousands of simultaneous requests.

Whether Inception Labs could maintain 1,000+ tokens per second while scaling to enterprise-level demand was an open question at launch. The algorithmic speedup is real, but production inference involves load balancing, batching, queuing, and hardware utilization tradeoffs that do not show up in single-user benchmarks.

The Diffusion Paradigm#

Inception Labs made a compelling case for why diffusion is the right paradigm shift for language models. Their core argument: frontier LLM companies are betting on test-time compute to increase reasoning capabilities, but generating long reasoning traces comes at the price of ballooning inference costs and unusable latency. Diffusion offers an alternative path.

The precedent is clear. Diffusion already powers the most successful AI applications for images (Stable Diffusion, Midjourney), video (Sora), and audio (Riffusion). These are all domains where the coarse-to-fine refinement process produces better results than sequential generation. The question was always whether the same approach could work for discrete data like text and code. Mercury demonstrated that it can.

Try It Yourself#

At launch, Inception Labs provided a web interface at chat.inceptionlabs.ai where you could interact with Mercury directly. The interface included an option to enable the diffusion animation, showing the coarse-to-fine text generation process in real time.

Watching the animation is genuinely striking. Text appears as garbled noise across the entire response, then sharpens with each iteration until it reads naturally. It is a visual demonstration of how fundamentally different the generation process is from the token-by-token output you see with autoregressive models.

For code generation tasks, the speed is immediately apparent. A JavaScript animation request that would take 30 seconds with ChatGPT appears in full within a few seconds. The output quality was competitive with the smaller models from OpenAI and Anthropic, making Mercury a viable option for applications where response time is the primary concern.

Implications for Model Architecture Research#

Mercury's launch raised questions that extend beyond one startup's product. If diffusion works for text generation, it suggests that the entire field of language modeling has been constrained by the autoregressive assumption. Every major LLM - GPT, Claude, Gemini, Llama, DeepSeek - generates text sequentially. Mercury demonstrated that this constraint is not fundamental. It is an architectural choice, and alternative choices exist.

The self-correction property of diffusion is particularly interesting for coding applications. Autoregressive models commit to each line of code as they generate it. If line 50 creates a bug that is only apparent in the context of line 100, the model cannot go back and fix it. A diffusion model can, because every iteration has access to the full output and can modify any part of it.

This does not mean diffusion models are automatically better at coding. The quality depends on training data, model size, and the denoising architecture. But the theoretical advantage of full-output visibility during generation is real, and future models may exploit it more effectively.

What Came Next#

Mercury's launch in February 2025 proved the concept. A diffusion LLM could match the quality of established autoregressive models at dramatically higher speeds. The model was not frontier-class in terms of raw capability, but it did not need to be. The architecture was the breakthrough.

The implications extend beyond a single company. If diffusion-based text generation works at commercial scale, it opens a new dimension of competition in the LLM market. Speed, quality, and cost have always been the three axes. Autoregressive models optimize along quality and cost. Diffusion models add a massive speed advantage without proportional quality loss.

The follow-up, Mercury 2, would push the concept further by adding reasoning capabilities to the diffusion architecture. But the original Mercury launch was the moment that proved diffusion language models were not just a research curiosity. They were a viable, commercial-grade alternative to the autoregressive paradigm that had dominated the field since GPT-2.

For developers building real-time AI applications, this was one of the most important architectural developments of early 2025. The question shifted from "how do we make autoregressive models faster?" to "do we need autoregressive models at all?"

Frequently Asked Questions#

What is a diffusion language model?#

A diffusion language model generates text by starting with noise and refining it over multiple iterations, rather than producing tokens one at a time sequentially. The process is similar to how image generation models like Stable Diffusion work - the model begins with a garbled representation of the entire output and progressively sharpens it until it becomes coherent text. This allows the model to generate and modify all tokens in parallel.

How fast is Mercury compared to GPT-4 or Claude?#

Mercury generates text 10 to 15 times faster than mainstream autoregressive models. At launch, Mercury Coder Mini exceeded 1,000 tokens per second on standard Nvidia H100 hardware, compared to roughly 60-70 tokens per second for GPT-4o Mini or Claude 3.5 Haiku. In direct comparisons, tasks that took ChatGPT 36 seconds and Claude 28 seconds completed in 6 seconds with Mercury.

Does Mercury require special hardware?#

No. Mercury achieves its speed improvements through architecture, not specialized silicon. It runs on commodity Nvidia H100 GPUs that are widely available from major cloud providers. This contrasts with other high-speed inference approaches from companies like Groq or Cerebras that require purpose-built hardware. Inception Labs noted that the speedups would compound on faster hardware like Nvidia Blackwell GPUs.

How good is Mercury's output quality?#

Mercury's quality is competitive with mid-tier models like GPT-4o Mini, Claude 3.5 Haiku, and Gemini 2.0 Flash Lite. On HumanEval (a standard code generation benchmark), Mercury scored 88-90%. In the Copilot Arena where developers evaluated code generation quality in blind tests, Mercury ranked first for speed and second for quality. It is not frontier-class in raw capability, but matches established lightweight models.

Can diffusion models self-correct their output?#

Yes, and this is a structural advantage. Because diffusion models can see and modify their entire output at every iteration, they can catch and fix errors introduced in earlier passes. Autoregressive models commit to each token permanently as they generate it - if token 50 makes token 10 look wrong in retrospect, there is no going back. Diffusion models can revise any part of their output during refinement.

Does Mercury support tool use and agentic workflows?#

Yes. Mercury supports RAG, tool calling, and agentic workflows - the building blocks of modern AI applications. The architecture is not limited to simple text generation. This makes it viable for production applications that need to interact with external APIs, databases, or other tools.

What are the implications for inference costs?#

Faster generation on the same hardware means lower cost per token. For high-volume applications processing thousands of requests per minute, the economics shift substantially. Real-time applications that were previously too expensive due to inference latency become feasible. The 10x speed improvement on standard GPUs translates directly to reduced infrastructure costs.

Will autoregressive models become obsolete?#

Not immediately, but diffusion represents a viable alternative path. Every major LLM (GPT, Claude, Gemini, Llama, DeepSeek) uses autoregressive generation. Mercury demonstrated this constraint is an architectural choice, not a fundamental requirement. For applications where speed is the primary concern and mid-tier quality is acceptable, diffusion models offer compelling advantages. Frontier reasoning tasks may still favor autoregressive approaches with test-time compute, but the tradeoff space has expanded.

Watch the Video#

Official Sources#

Resource	Link
Inception Labs Homepage	inceptionlabs.ai
Mercury Chat Interface	chat.inceptionlabs.ai
Inception Labs Blog	inceptionlabs.ai/blog
Mercury API Documentation	docs.inceptionlabs.ai
Mercury Launch Announcement	inceptionlabs.ai/blog/introducing-mercury
Mercury Technical Report	arxiv.org/abs/2506.17298

A Different Way to Generate Text#

Why Speed Is the Headline#

That is not a small improvement. Mercury was generating text 10 to 15 times faster than the mainstream alternatives.

How Diffusion Generation Works for Text#

Diffusion models break this constraint. The process works in three phases:

Initialization - Start with a noisy representation of the entire output. Think of it as a garbled version of the final answer where every position has some text, but most of it is wrong.
Iterative Refinement - A transformer model evaluates the entire noisy output and suggests improvements. Because it looks at the whole sequence simultaneously, it can modify multiple tokens in parallel. Each denoising step makes the output cleaner and more coherent.
Convergence - After enough iterations, the output stabilizes into a clear, natural-language response.

Newsletter

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools, delivered free every week.

From the archive

Benchmark Performance#

Why This Architecture Matters for Developers#

The practical implications for application development are significant:

Standard hardware works. You do not need to negotiate access to specialized inference chips or lock into a single hardware vendor. H100s are widely available from every major cloud provider.

Official Sources#

A Different Way to Generate Text#

Why Speed Is the Headline#

How Diffusion Generation Works for Text#

xAI Grok 3 Launch: The Smartest AI on Earth?

Unstract: Open-Source AI Document Parsing at Scale

OpenAI Deep Research: The AI Agent That Does Your Homework

ChatGPT Tasks: Scheduled AI Agents Inside ChatGPT

Benchmark Performance#

Why This Architecture Matters for Developers#

The Production Question#

The Diffusion Paradigm#

Try It Yourself#

Implications for Model Architecture Research#

What Came Next#