TL;DR
A hands-on developer guide to Mercury 2 from Inception Labs. OpenAI-compatible API, reasoning levels, tool use, structured outputs, and when a diffusion LLM beats an autoregressive one in real apps.
Read next
Inception Labs shipped the first reasoning model built on diffusion instead of autoregressive generation. Over 1,000 tokens per second, competitive benchmarks, and a fundamentally different approach to how AI generates text.
8 min readInception Labs launched Mercury, the first commercial-grade diffusion large language model. It generates over 1,000 tokens per second on standard Nvidia hardware by replacing autoregressive generation with a coarse-to-fine diffusion process.
7 min readA practical comparison of the five major AI agent frameworks in 2026 - architecture, code examples, and a decision matrix to help you pick the right one.
14 min readThe Mercury 2 announcement post covered what the model is and why diffusion language models matter. This post is for the developer who already gets the pitch and wants to know what it actually feels like to build with one. We will wire up the API, run a real agent loop, talk about the trade-offs nobody tweets about, and figure out where Mercury 2 belongs in your stack and where it does not.
If you have not read the primer on diffusion language models, the short version is this. Every other LLM you use generates one token at a time, locking each token before moving on. Mercury 2 does not. It generates multiple tokens per forward pass and refines the output across iterations, the same coarse-to-fine process that powers image and video diffusion. That single design choice is why it clears 1,000 tokens per second on standard hardware while staying competitive on reasoning benchmarks.
Before any code, here is what shapes the build decisions:
That last point is the one most teams miss. Mercury 2 lets you pick how hard the model thinks per request. You do not have to commit to a single reasoning budget for your whole app the way you do with most reasoning models.
The API is OpenAI-compatible. If your app already talks to OpenAI, the migration is three changes: base URL, model string, API key.
Here is a minimal Python call using the OpenAI SDK:
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["INCEPTION_API_KEY"],
base_url="https://api.inceptionlabs.ai/v1",
)
response = client.chat.completions.create(
model="mercury-2",
messages=[
{"role": "system", "content": "You answer concisely."},
{"role": "user", "content": "Explain diffusion sampling in two sentences."},
],
extra_body={"reasoning_effort": "low"},
)
print(response.choices[0].message.content)
The same shape in TypeScript with the Vercel AI SDK, which is what the demo in the original video uses:
import { createOpenAI } from "@ai-sdk/openai";
import { generateText } from "ai";
const inception = createOpenAI({
apiKey: process.env.INCEPTION_API_KEY,
baseURL: "https://api.inceptionlabs.ai/v1",
});
const { text } = await generateText({
model: inception("mercury-2"),
prompt: "Summarize the difference between diffusion and autoregressive LLMs.",
});
console.log(text);
That is the entire onboarding cost. No new SDK, no new auth pattern, no rewriting your retry logic. If you have already built an agent on top of Anthropic, OpenAI, or DeepSeek, you can swap Mercury 2 in behind a config flag.
Tool use is where Mercury 2 starts to feel different in production. Tool calls are where autoregressive models eat your latency budget alive. Each call generates a JSON payload sequentially. Each round trip waits on token-by-token output before the orchestration layer can fire the actual tool. In a five-step agent loop you pay that latency tax five times.
Diffusion generation collapses that tax. Here is a tool definition the way the video demo uses it, with a browser tool that scrapes Hacker News:
tools = [{
"type": "function",
"function": {
"name": "open_url",
"description": "Open a URL and return the page text.",
"parameters": {
"type": "object",
"properties": {
"url": {"type": "string"},
},
"required": ["url"],
},
},
}]
response = client.chat.completions.create(
model="mercury-2",
messages=[
{"role": "user", "content": "Find the top three AI stories on Hacker News and summarize the comments."},
],
tools=tools,
extra_body={"reasoning_effort": "medium"},
)
for call in response.choices[0].message.tool_calls or []:
print(call.function.name, call.function.arguments)
In a tool-heavy agent the wall clock time on Mercury 2 lands somewhere between five and ten times faster than a comparable autoregressive model running the same loop. That is not a benchmark gain, that is a UX gain.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Apr 29, 2026 • 13 min read
Apr 29, 2026 • 9 min read
Apr 29, 2026 • 12 min read
Apr 29, 2026 • 11 min read
Diffusion is a natural fit for structured generation because the model refines the entire output at once instead of committing left to right. Schema adherence stops feeling like a fight with the sampler.
schema = {
"name": "extract_post",
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"url": {"type": "string"},
"comment_count": {"type": "integer"},
"summary": {"type": "string"},
},
"required": ["title", "url", "comment_count", "summary"],
},
}
response = client.chat.completions.create(
model="mercury-2",
messages=[{"role": "user", "content": page_text}],
response_format={"type": "json_schema", "json_schema": schema},
)
The schema returns clean on the first pass at "low" reasoning effort for most extraction tasks. That is the use case I would migrate first if you are running a high-volume scraping or normalization pipeline.
Mercury 2 exposes four levels through the reasoning_effort parameter: instant, low, medium, high. Treat them like a knob between latency and quality, not a quality dial alone.
A working rule from a few weeks of building with it:
The key insight is that you can mix them in a single user-facing flow. Use instant for the planner, medium for the executor, low for the formatter. Most of your latency budget gets spent where reasoning actually matters.
The honest answer is, anywhere latency multiplies.
A few honest caveats from time in the trenches:
The framing that finally clicked for me. Autoregressive generation is a typewriter. Each keystroke is permanent. If the model commits to a wrong token early, the rest of the output has to work around that mistake. That is where reasoning models burn tokens correcting themselves mid-stream.
Diffusion generation is an editor with a draft. The model produces a rough version of the entire output, then refines it across iterations. Mistakes get caught and fixed during generation, not after. That is why diffusion and reasoning compose so naturally. The reasoning step is not bolted on, it is part of the sampling loop.
This is the same architectural shift that took image generation from GANs to Stable Diffusion. The people who built those original diffusion methods, including Stefano Ermon at Stanford, are the people who founded Inception Labs. Mercury 2 is them applying the same playbook to text.
If you want to A/B Mercury 2 against your current model, here is the shortest path I have found:
INCEPTION_API_KEY and a feature flag for the model selection.reasoning_effort: "low" and only step up if you see quality regressions.The whole switch is a half-day of work for most apps. The decision after that is a quality call against your own evals.
Mercury 2 is the first model that makes me think the autoregressive monoculture has a real challenger, not just a faster sibling. The benchmarks land in the right zip code. The price is aggressive. The OpenAI compatibility kills the integration cost. And the reasoning-level knob means you do not have to pick a single point on the latency-quality curve for your whole app.
I would not throw out my Sonnet or GPT-5 calls today. I would route every latency-sensitive path through Mercury 2 and start measuring. That is where the wins live, and that is where the architecture actually pays off.
If you want the original deep dive, the Mercury 2 video walkthrough on the channel covers the demo and the diffusion explainer. If you want the broader context on agent frameworks, the agent frameworks comparison is the right next read.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Open-source ChatGPT alternative that runs 100% offline. Desktop app with local models, cloud API connections, custom ass...
View ToolVercel's generative UI tool. Describe a component, get production-ready React code with shadcn/ui and Tailwind. Iterate...
View ToolOpen-source AI pair programming in your terminal. Works with any LLM - Claude, GPT, Gemini, local models. Git-aware ed...
View ToolThe TypeScript toolkit for building AI apps. Unified API across OpenAI, Anthropic, Google. Streaming, tool calling, stru...
View ToolOpenAI Assistants API is sunsetting August 26 2026. Paste your code, get Responses API equivalent. Built for the migration deadline.
Open AppVS Code extension showing live LLM API spend right in your editor as you work.
Open AppVisual cron job scheduler with NLP scheduling, failure alerts, and run history.
Open AppInstall the dd CLI and scaffold your first AI-powered app in under a minute.
Getting StartedInstall Ollama and LM Studio, pull your first model, and run AI locally for coding, chat, and automation - with zero cloud dependency.
Getting StartedInstall Claude Code, configure your first project, and start shipping code with AI in under 5 minutes.
Getting Started
Inception Labs shipped the first reasoning model built on diffusion instead of autoregressive generation. Over 1,000 tok...

Inception Labs launched Mercury, the first commercial-grade diffusion large language model. It generates over 1,000 toke...

A practical comparison of the five major AI agent frameworks in 2026 - architecture, code examples, and a decision matri...

A comprehensive look at Claude Skills-modular, persistent task modules that shatter AI's memory constraints and enable p...

Building a full-stack AI SaaS application no longer requires months of development. The right combination of managed ser...

GPT-5 introduces a fundamentally different approach to inference. Instead of forcing developers to manually configure re...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.