TL;DR
Inception Labs launched Mercury, the first commercial-grade diffusion large language model. It generates over 1,000 tokens per second on standard Nvidia hardware by replacing autoregressive generation with a coarse-to-fine diffusion process.
Every large language model you have used works the same way. GPT, Claude, Gemini, Llama, DeepSeek - they are all autoregressive. They generate text one token at a time, left to right, sequentially. Each token requires a full forward pass through billions of parameters, and the next token cannot be generated until the previous one exists. This is why even the fastest LLMs feel slow on long outputs.
Inception Labs built Mercury to challenge that assumption. Mercury is a diffusion large language model. Instead of generating tokens sequentially, it produces the entire response at once and refines it over multiple iterations, starting from noise and progressively sharpening the output until it reaches a coherent answer.
If you have seen how image generation works with Stable Diffusion or Midjourney, the concept is identical. Those models start with random noise and denoise it step by step until a clear image appears. Mercury applies the same principle to text. The first iteration is nearly unreadable. Each subsequent pass cleans it up, adjusts word choices, fixes structure, and tightens the response until it reads naturally.
The numbers tell the story. At launch, Mercury Coder Small ran at approximately 750 tokens per second. Mercury Coder Mini exceeded 1,000 tokens per second. Compare that to GPT-4o Mini at roughly 60 to 70 tokens per second, or Claude 3.5 Haiku at a similar speed.
That is not a small improvement. Mercury was generating text 10 to 15 times faster than the mainstream alternatives.
In a direct comparison shown during the announcement, a code generation task took ChatGPT 36 seconds to complete and Claude 28 seconds. Mercury finished the same task in 6 seconds. The speed difference is visible and dramatic.
The critical detail is that Mercury achieves these speeds on commodity Nvidia H100 hardware. You do not need specialized inference chips. Previously, the only way to get token generation speeds in this range was through purpose-built hardware from companies like Groq, Cerebras, or SambaNova. Mercury's approach is purely algorithmic. The speedup comes from the architecture, not the silicon.
Inception Labs also noted that their improvements are orthogonal to hardware acceleration, meaning the speedups would compound on faster chips. Running Mercury on Nvidia Blackwell GPUs, for instance, would push the numbers even higher.
The autoregressive approach has a fundamental constraint: each token depends on every token before it. This makes generation inherently sequential. You cannot parallelize the core generation loop because token N requires tokens 1 through N-1 to exist first.
Diffusion models break this constraint. The process works in three phases:
Initialization - Start with a noisy representation of the entire output. Think of it as a garbled version of the final answer where every position has some text, but most of it is wrong.
Iterative Refinement - A transformer model evaluates the entire noisy output and suggests improvements. Because it looks at the whole sequence simultaneously, it can modify multiple tokens in parallel. Each denoising step makes the output cleaner and more coherent.
Convergence - After enough iterations, the output stabilizes into a clear, natural-language response.
Because the model is not restricted to only considering previous output, Inception Labs argues it has structural advantages for reasoning and response organization. It can see the full context of its own output at every step and adjust any part of it. Autoregressive models commit to each token permanently as they generate it. If token 50 makes token 10 look wrong in retrospect, there is no going back.
Diffusion models can also continually refine their output, which gives them a mechanism for self-correction. If an early iteration introduces a hallucination, later iterations can catch and fix it. This is not guaranteed, but the architecture at least makes it possible.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
Mercury's first release was not targeting frontier model performance. The comparison set was mid-tier: GPT-4o Mini, Claude 3.5 Haiku, Gemini 2.0 Flash Lite, Qwen, and DeepSeek Mind. Against this group, Mercury held its own.
On HumanEval, the standard code generation benchmark, Mercury scored 88 to 90 depending on the variant. These are strong results for a first-generation model using a fundamentally new architecture. It was not outperforming Sonnet or the full GPT-4o, but it was competitive with the lightweight models from every major lab.
In the Copilot Arena, where real developers evaluated code generation quality in blind tests, Mercury ranked number one for speed and number two for quality. Developers preferred its output over the alternatives when judged without knowing which model produced it.
The benchmark story is one of potential rather than dominance. If the first commercial diffusion LLM matches the quality of established lightweight models while running 10x faster, the trajectory for future versions becomes very interesting.
The practical implications for application development are significant:
Real-time applications become feasible. At 1,000+ tokens per second, you can generate substantial responses in real time without users noticing any lag. Chat interfaces, code completion, inline suggestions - these all benefit from lower latency.
Inference costs drop. Faster generation on the same hardware means lower cost per token. For high-volume applications where you are processing thousands of requests per minute, the economics shift substantially.
Standard hardware works. You do not need to negotiate access to specialized inference chips or lock into a single hardware vendor. H100s are widely available from every major cloud provider.
Tool use and agentic workflows are supported. Mercury was not limited to simple text generation. The launch materials confirmed support for RAG, tool calling, and agentic workflows, the building blocks of modern AI applications.
One important caveat from the launch: the speed benchmarks were measured on controlled hardware with controlled load. In production, maintaining those speeds under real traffic is a different challenge.
Claude 3.5 Haiku and GPT-4o Mini run at 60 to 70 tokens per second in production, but those are endpoints handling enormous concurrent demand. The speed is bottlenecked not just by the model but by the infrastructure serving thousands of simultaneous requests.
Whether Inception Labs could maintain 1,000+ tokens per second while scaling to enterprise-level demand was an open question at launch. The algorithmic speedup is real, but production inference involves load balancing, batching, queuing, and hardware utilization tradeoffs that do not show up in single-user benchmarks.
Inception Labs made a compelling case for why diffusion is the right paradigm shift for language models. Their core argument: frontier LLM companies are betting on test-time compute to increase reasoning capabilities, but generating long reasoning traces comes at the price of ballooning inference costs and unusable latency. Diffusion offers an alternative path.
The precedent is clear. Diffusion already powers the most successful AI applications for images (Stable Diffusion, Midjourney), video (Sora), and audio (Refusion). These are all domains where the coarse-to-fine refinement process produces better results than sequential generation. The question was always whether the same approach could work for discrete data like text and code. Mercury demonstrated that it can.
At launch, Inception Labs provided a web interface at chat.inceptionlabs.ai where you could interact with Mercury directly. The interface included an option to enable the diffusion animation, showing the coarse-to-fine text generation process in real time.
Watching the animation is genuinely striking. Text appears as garbled noise across the entire response, then sharpens with each iteration until it reads naturally. It is a visual demonstration of how fundamentally different the generation process is from the token-by-token output you see with autoregressive models.
For code generation tasks, the speed is immediately apparent. A JavaScript animation request that would take 30 seconds with ChatGPT appears in full within a few seconds. The output quality was competitive with the smaller models from OpenAI and Anthropic, making Mercury a viable option for applications where response time is the primary concern.
Mercury's launch raised questions that extend beyond one startup's product. If diffusion works for text generation, it suggests that the entire field of language modeling has been constrained by the autoregressive assumption. Every major LLM - GPT, Claude, Gemini, Llama, DeepSeek - generates text sequentially. Mercury demonstrated that this constraint is not fundamental. It is an architectural choice, and alternative choices exist.
The self-correction property of diffusion is particularly interesting for coding applications. Autoregressive models commit to each line of code as they generate it. If line 50 creates a bug that is only apparent in the context of line 100, the model cannot go back and fix it. A diffusion model can, because every iteration has access to the full output and can modify any part of it.
This does not mean diffusion models are automatically better at coding. The quality depends on training data, model size, and the denoising architecture. But the theoretical advantage of full-output visibility during generation is real, and future models may exploit it more effectively.
Mercury's launch in February 2025 proved the concept. A diffusion LLM could match the quality of established autoregressive models at dramatically higher speeds. The model was not frontier-class in terms of raw capability, but it did not need to be. The architecture was the breakthrough.
The implications extend beyond a single company. If diffusion-based text generation works at commercial scale, it opens a new dimension of competition in the LLM market. Speed, quality, and cost have always been the three axes. Autoregressive models optimize along quality and cost. Diffusion models add a massive speed advantage without proportional quality loss.
The follow-up, Mercury 2, would push the concept further by adding reasoning capabilities to the diffusion architecture. But the original Mercury launch was the moment that proved diffusion language models were not just a research curiosity. They were a viable, commercial-grade alternative to the autoregressive paradigm that had dominated the field since GPT-2.
For developers building real-time AI applications, this was one of the most important architectural developments of early 2025. The question shifted from "how do we make autoregressive models faster?" to "do we need autoregressive models at all?"
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
Mercury Two: The First Reasoning Diffusion LLM (1,000+ tokens/sec) - Speed Without Sacrificing Quality Inception Labs releases Mercury Two, a reasoning diffusion-based LLM that exceeds 1,000 tokens p

Inception Labs shipped the first reasoning model built on diffusion instead of autoregressive generation. Over 1,000 tok...

A comprehensive look at Claude Skills-modular, persistent task modules that shatter AI's memory constraints and enable p...

GPT-5 introduces a fundamentally different approach to inference. Instead of forcing developers to manually configure re...