TL;DR
A hands-on developer guide to Mercury 2 from Inception Labs. OpenAI-compatible API, reasoning levels, tool use, structured outputs, and when a diffusion LLM beats an autoregressive one in real apps.
| Official Sources | |
|---|---|
| Inception Labs | Mercury 2 homepage and product overview |
| Inception Labs API | OpenAI-compatible API endpoint |
| Mercury 2 Announcement | Model launch details and benchmarks |
| OpenAI SDK | Python SDK for OpenAI-compatible APIs |
| Vercel AI SDK | TypeScript SDK used in the video demo |
The Mercury 2 announcement post covered what the model is and why diffusion language models matter. This post is for the developer who already gets the pitch and wants to know what it actually feels like to build with one. We will wire up the API, run a real agent loop, talk about the trade-offs nobody tweets about, and figure out where Mercury 2 belongs in your stack and where it does not.
If you have not read the primer on diffusion language models, the short version is this. Every other LLM you use generates one token at a time, locking each token before moving on. Mercury 2 does not. It generates multiple tokens per forward pass and refines the output across iterations, the same coarse-to-fine process that powers image and video diffusion. That single design choice is why it clears 1,000 tokens per second on standard hardware while staying competitive on reasoning benchmarks.
Before any code, here is what shapes the build decisions:
That last point is the one most teams miss. Mercury 2 lets you pick how hard the model thinks per request. You do not have to commit to a single reasoning budget for your whole app the way you do with most reasoning models.
The API is OpenAI-compatible. If your app already talks to OpenAI, the migration is three changes: base URL, model string, API key.
Here is a minimal Python call using the OpenAI SDK:
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["INCEPTION_API_KEY"],
base_url="https://api.inceptionlabs.ai/v1",
)
response = client.chat.completions.create(
model="mercury-2",
messages=[
{"role": "system", "content": "You answer concisely."},
{"role": "user", "content": "Explain diffusion sampling in two sentences."},
],
extra_body={"reasoning_effort": "low"},
)
print(response.choices[0].message.content)
The same shape in TypeScript with the Vercel AI SDK, which is what the demo in the original video uses:
import { createOpenAI } from "@ai-sdk/openai";
import { generateText } from "ai";
const inception = createOpenAI({
apiKey: process.env.INCEPTION_API_KEY,
baseURL: "https://api.inceptionlabs.ai/v1",
});
const { text } = await generateText({
model: inception("mercury-2"),
prompt: "Summarize the difference between diffusion and autoregressive LLMs.",
});
console.log(text);
That is the entire onboarding cost. No new SDK, no new auth pattern, no rewriting your retry logic. If you have already built an agent on top of Anthropic, OpenAI, or DeepSeek, you can swap Mercury 2 in behind a config flag.
Tool use is where Mercury 2 starts to feel different in production. Tool calls are where autoregressive models eat your latency budget alive. Each call generates a JSON payload sequentially. Each round trip waits on token-by-token output before the orchestration layer can fire the actual tool. In a five-step agent loop you pay that latency tax five times.
Diffusion generation collapses that tax. Here is a tool definition the way the video demo uses it, with a browser tool that scrapes Hacker News:
tools = [{
"type": "function",
"function": {
"name": "open_url",
"description": "Open a URL and return the page text.",
"parameters": {
"type": "object",
"properties": {
"url": {"type": "string"},
},
"required": ["url"],
},
},
}]
response = client.chat.completions.create(
model="mercury-2",
messages=[
{"role": "user", "content": "Find the top three AI stories on Hacker News and summarize the comments."},
],
tools=tools,
extra_body={"reasoning_effort": "medium"},
)
for call in response.choices[0].message.tool_calls or []:
print(call.function.name, call.function.arguments)
In a tool-heavy agent the wall clock time on Mercury 2 lands somewhere between five and ten times faster than a comparable autoregressive model running the same loop. That is not a benchmark gain, that is a UX gain.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Apr 29, 2026 • 13 min read
Apr 29, 2026 • 9 min read
Apr 29, 2026 • 11 min read
Apr 29, 2026 • 10 min read
Diffusion is a natural fit for structured generation because the model refines the entire output at once instead of committing left to right. Schema adherence stops feeling like a fight with the sampler.
schema = {
"name": "extract_post",
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"url": {"type": "string"},
"comment_count": {"type": "integer"},
"summary": {"type": "string"},
},
"required": ["title", "url", "comment_count", "summary"],
},
}
response = client.chat.completions.create(
model="mercury-2",
messages=[{"role": "user", "content": page_text}],
response_format={"type": "json_schema", "json_schema": schema},
)
The schema returns clean on the first pass at "low" reasoning effort for most extraction tasks. That is the use case I would migrate first if you are running a high-volume scraping or normalization pipeline.
Mercury 2 exposes four levels through the reasoning_effort parameter: instant, low, medium, high. Treat them like a knob between latency and quality, not a quality dial alone.
A working rule from a few weeks of building with it:
The key insight is that you can mix them in a single user-facing flow. Use instant for the planner, medium for the executor, low for the formatter. Most of your latency budget gets spent where reasoning actually matters.
The honest answer is, anywhere latency multiplies.
A few honest caveats from time in the trenches:
The framing that finally clicked for me. Autoregressive generation is a typewriter. Each keystroke is permanent. If the model commits to a wrong token early, the rest of the output has to work around that mistake. That is where reasoning models burn tokens correcting themselves mid-stream.
Diffusion generation is an editor with a draft. The model produces a rough version of the entire output, then refines it across iterations. Mistakes get caught and fixed during generation, not after. That is why diffusion and reasoning compose so naturally. The reasoning step is not bolted on, it is part of the sampling loop.
This is the same architectural shift that took image generation from GANs to Stable Diffusion. The people who built those original diffusion methods, including Stefano Ermon at Stanford, are the people who founded Inception Labs. Mercury 2 is them applying the same playbook to text.
If you want to A/B Mercury 2 against your current model, here is the shortest path I have found:
INCEPTION_API_KEY and a feature flag for the model selection.reasoning_effort: "low" and only step up if you see quality regressions.The whole switch is a half-day of work for most apps. The decision after that is a quality call against your own evals.
Mercury 2 is the first model that makes me think the autoregressive monoculture has a real challenger, not just a faster sibling. The benchmarks land in the right zip code. The price is aggressive. The OpenAI compatibility kills the integration cost. And the reasoning-level knob means you do not have to pick a single point on the latency-quality curve for your whole app.
I would not throw out my Sonnet or GPT-5 calls today. I would route every latency-sensitive path through Mercury 2 and start measuring. That is where the wins live, and that is where the architecture actually pays off.
If you want the original deep dive, the Mercury 2 video walkthrough on the channel covers the demo and the diffusion explainer. If you want the broader context on agent frameworks, the agent frameworks comparison is the right next read.
Mercury 2 is a diffusion-based large language model from Inception Labs. Unlike autoregressive models (GPT, Claude, Llama) that generate one token at a time, Mercury 2 generates multiple tokens per forward pass and refines the output across iterations. This architectural difference enables throughput over 1,000 tokens per second while maintaining competitive quality on reasoning benchmarks.
Mercury 2 achieves over 1,000 tokens per second, compared to roughly 89 t/s for Claude Haiku 4.5 and 71 t/s for GPT-5 Mini. In tool-heavy agent workflows, wall clock time can be five to ten times faster than comparable autoregressive models running the same loop.
Mercury 2 costs $0.25 per million input tokens and $0.75 per million output tokens. Combined with its high throughput, this makes it cost-effective for high-volume workloads like document processing, classification, and extraction pipelines.
Yes. Mercury 2 supports tool calling and structured JSON outputs through an OpenAI-compatible API. Diffusion generation is particularly effective for structured outputs because the model refines the entire output at once instead of committing left to right, improving schema adherence on the first pass.
Mercury 2 exposes four reasoning levels through the reasoning_effort parameter: instant, low, medium, and high. Instant is for classification and routing. Low handles extraction and summarization. Medium works for multi-tool agent loops. High is for math and deep code reasoning. You can mix reasoning levels within a single user flow.
Mercury 2 excels where latency multiplies: voice agents, coding iteration loops, high-fanout document pipelines, and real-time UIs that need streaming structured data. Use autoregressive models when you need specific creative voice, stream-of-consciousness generation, or prompts already tuned to a particular model's behavior.
Mercury 2 uses an OpenAI-compatible API. Migration requires three changes: set the base URL to https://api.inceptionlabs.ai/v1, change the model string to mercury-2, and use your Inception Labs API key. Existing prompt and tool definitions work without modification.
Read next
Inception Labs shipped the first reasoning model built on diffusion instead of autoregressive generation. Over 1,000 tokens per second, competitive benchmarks, and a fundamentally different approach to how AI generates text.
8 min readInception Labs launched Mercury, the first commercial-grade diffusion large language model. It generates over 1,000 tokens per second on standard Nvidia hardware by replacing autoregressive generation with a coarse-to-fine diffusion process.
7 min readOpenRouter gives you one API key for 300+ models, automatic fallbacks, and intelligent provider routing. Here is what it actually costs, how to set it up in five minutes, and when you should skip it entirely.
8 min readTechnical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Open-source ChatGPT alternative that runs 100% offline. Desktop app with local models, cloud API connections, custom ass...
View ToolOpen-source terminal agent runtime with approval modes, rollback snapshots, MCP servers, LSP diagnostics, and a headless...
View ToolVercel's generative UI tool. Describe a component, get production-ready React code with shadcn/ui and Tailwind. Iterate...
View ToolOpen-source AI pair programming in your terminal. Works with any LLM - Claude, GPT, Gemini, local models. Git-aware ed...
View ToolDocument API key ownership, rotation context, and integration notes without storing secrets.
View AppBeat the August 2026 Assistants API sunset. Paste old code, get Responses API.
View AppUnlock pro skills and share private collections with your team.
View AppInstall the dd CLI and scaffold your first AI-powered app in under a minute.
Getting StartedA complete, citation-backed Claude Code course with setup, prompting systems, MCP, CI, security, cost controls, and capstone workflows.
ai-developmentInstall Ollama and LM Studio, pull your first model, and run AI locally for coding, chat, and automation - with zero cloud dependency.
Getting Started
Inception Labs shipped the first reasoning model built on diffusion instead of autoregressive generation. Over 1,000 tok...

Inception Labs launched Mercury, the first commercial-grade diffusion large language model. It generates over 1,000 toke...
OpenRouter gives you one API key for 300+ models, automatic fallbacks, and intelligent provider routing. Here is what it...

A comprehensive look at Claude Skills-modular, persistent task modules that shatter AI's memory constraints and enable p...

Building a full-stack AI SaaS application no longer requires months of development. The right combination of managed ser...

GPT-5 introduces a fundamentally different approach to inference. Instead of forcing developers to manually configure re...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.