Mercury 2 Developer Guide: Building With a Diffusion LLM in Production

Why This Guide Exists

The Mercury 2 announcement post covered what the model is and why diffusion language models matter. This post is for the developer who already gets the pitch and wants to know what it actually feels like to build with one. We will wire up the API, run a real agent loop, talk about the trade-offs nobody tweets about, and figure out where Mercury 2 belongs in your stack and where it does not.

If you have not read the primer on diffusion language models, the short version is this. Every other LLM you use generates one token at a time, locking each token before moving on. Mercury 2 does not. It generates multiple tokens per forward pass and refines the output across iterations, the same coarse-to-fine process that powers image and video diffusion. That single design choice is why it clears 1,000 tokens per second on standard hardware while staying competitive on reasoning benchmarks.

The Numbers That Matter for Production

Before any code, here is what shapes the build decisions:

Throughput: over 1,000 tokens per second, compared to roughly 89 t/s for Claude Haiku 4.5 and 71 t/s for GPT-5 Mini.
Quality: ties GPT-5 Mini on AIME 2025 at 91.1, scores competitively on GPQA and LiveCodeBench.
Pricing: $0.25 per million input tokens, $0.75 per million output tokens.
Context window: 128,000 tokens.
Features: tool use, structured outputs, RAG, OpenAI-compatible API.
Reasoning levels: instant, low, medium, high.

That last point is the one most teams miss. Mercury 2 lets you pick how hard the model thinks per request. You do not have to commit to a single reasoning budget for your whole app the way you do with most reasoning models.

Wiring Up the API

The API is OpenAI-compatible. If your app already talks to OpenAI, the migration is three changes: base URL, model string, API key.

Here is a minimal Python call using the OpenAI SDK:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["INCEPTION_API_KEY"],
    base_url="https://api.inceptionlabs.ai/v1",
)

response = client.chat.completions.create(
    model="mercury-2",
    messages=[
        {"role": "system", "content": "You answer concisely."},
        {"role": "user", "content": "Explain diffusion sampling in two sentences."},
    ],
    extra_body={"reasoning_effort": "low"},
)

print(response.choices[0].message.content)

The same shape in TypeScript with the Vercel AI SDK, which is what the demo in the original video uses:

import { createOpenAI } from "@ai-sdk/openai";
import { generateText } from "ai";

const inception = createOpenAI({
  apiKey: process.env.INCEPTION_API_KEY,
  baseURL: "https://api.inceptionlabs.ai/v1",
});

const { text } = await generateText({
  model: inception("mercury-2"),
  prompt: "Summarize the difference between diffusion and autoregressive LLMs.",
});

console.log(text);

That is the entire onboarding cost. No new SDK, no new auth pattern, no rewriting your retry logic. If you have already built an agent on top of Anthropic, OpenAI, or DeepSeek, you can swap Mercury 2 in behind a config flag.

Tool Use Without Wrapper Hell

Tool use is where Mercury 2 starts to feel different in production. Tool calls are where autoregressive models eat your latency budget alive. Each call generates a JSON payload sequentially. Each round trip waits on token-by-token output before the orchestration layer can fire the actual tool. In a five-step agent loop you pay that latency tax five times.

Diffusion generation collapses that tax. Here is a tool definition the way the video demo uses it, with a browser tool that scrapes Hacker News:

tools = [{
    "type": "function",
    "function": {
        "name": "open_url",
        "description": "Open a URL and return the page text.",
        "parameters": {
            "type": "object",
            "properties": {
                "url": {"type": "string"},
            },
            "required": ["url"],
        },
    },
}]

response = client.chat.completions.create(
    model="mercury-2",
    messages=[
        {"role": "user", "content": "Find the top three AI stories on Hacker News and summarize the comments."},
    ],
    tools=tools,
    extra_body={"reasoning_effort": "medium"},
)

for call in response.choices[0].message.tool_calls or []:
    print(call.function.name, call.function.arguments)

In a tool-heavy agent the wall clock time on Mercury 2 lands somewhere between five and ten times faster than a comparable autoregressive model running the same loop. That is not a benchmark gain, that is a UX gain.

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

From the archive

Model Context Protocol: A Production Guide To Building MCP Servers

Apr 29, 2026 • 13 min read

NVIDIA Nemotron 3 Super: A Developer's Guide to the 120B Hybrid MoE

Apr 29, 2026 • 9 min read

Open-Source MCP Servers Worth Installing in 2026

Apr 29, 2026 • 12 min read

OpenAI AgentKit in Production: An Honest Builder's Review

Apr 29, 2026 • 11 min read

Structured Outputs

Diffusion is a natural fit for structured generation because the model refines the entire output at once instead of committing left to right. Schema adherence stops feeling like a fight with the sampler.

schema = {
    "name": "extract_post",
    "schema": {
        "type": "object",
        "properties": {
            "title": {"type": "string"},
            "url": {"type": "string"},
            "comment_count": {"type": "integer"},
            "summary": {"type": "string"},
        },
        "required": ["title", "url", "comment_count", "summary"],
    },
}

response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": page_text}],
    response_format={"type": "json_schema", "json_schema": schema},
)

The schema returns clean on the first pass at "low" reasoning effort for most extraction tasks. That is the use case I would migrate first if you are running a high-volume scraping or normalization pipeline.

Choosing a Reasoning Level

Mercury 2 exposes four levels through the reasoning_effort parameter: instant, low, medium, high. Treat them like a knob between latency and quality, not a quality dial alone.

A working rule from a few weeks of building with it:

instant: classification, routing, intent detection, autocomplete-style suggestions, anything where you would have used a 7B chat model.
low: schema extraction, summarization, single-tool calls, RAG answer generation against a clean retrieval set.
medium: multi-tool agent loops, RAG over messy retrieval, code edits across one or two files, planning steps.
high: math, deep code reasoning, agent loops with conditional branching, anything where you would have reached for an o-series model or a Claude Sonnet thinking variant.

The key insight is that you can mix them in a single user-facing flow. Use instant for the planner, medium for the executor, low for the formatter. Most of your latency budget gets spent where reasoning actually matters.

Where Mercury 2 Beats Autoregressive

The honest answer is, anywhere latency multiplies.

Voice agents. The P95 of voice UX is brutal. Sub-second total turn time is table stakes, and most autoregressive reasoning models cannot get there. Mercury 2 can do tool-augmented turns inside that budget at low or medium reasoning.
Coding iteration. Tight feedback loops where you prompt, review, tweak. The diff between a 1,000 t/s edit and an 80 t/s edit changes how you work. It moves you from "wait for the model" to "thinking with the model".
High-fanout pipelines. Document processing, classification, normalization. If you are paying for a million extractions a day, Mercury 2 at $0.25 input and $0.75 output is hard to beat, and the speed cuts your worker count.
Real-time UIs that need streaming structured data. Forms that fill themselves, dashboards that explain themselves, anything where the user is staring at the screen waiting for JSON.

Where I Would Not Reach for It Yet

A few honest caveats from time in the trenches:

Long-form creative writing where you want a specific voice. Autoregressive models still feel more natural in pure prose generation. This is shifting, but it is real today.
Agentic workflows where the model needs to commit early and never revisit. Diffusion's strength is revision. If your task is more "stream of consciousness" than "draft and refine", you will not see the same lift.
Anything that depends on a specific frontier model's quirks. If your prompts are tuned to Claude's RLHF flavor or GPT-5's instruction following, plan to retune. Mercury 2 follows instructions cleanly but it is its own model.

Diffusion vs Autoregressive: The Mental Model

The framing that finally clicked for me. Autoregressive generation is a typewriter. Each keystroke is permanent. If the model commits to a wrong token early, the rest of the output has to work around that mistake. That is where reasoning models burn tokens correcting themselves mid-stream.

Diffusion generation is an editor with a draft. The model produces a rough version of the entire output, then refines it across iterations. Mistakes get caught and fixed during generation, not after. That is why diffusion and reasoning compose so naturally. The reasoning step is not bolted on, it is part of the sampling loop.

This is the same architectural shift that took image generation from GANs to Stable Diffusion. The people who built those original diffusion methods, including Stefano Ermon at Stanford, are the people who founded Inception Labs. Mercury 2 is them applying the same playbook to text.

Migration Checklist

If you want to A/B Mercury 2 against your current model, here is the shortest path I have found:

Add an environment variable for INCEPTION_API_KEY and a feature flag for the model selection.
Swap the base URL and model string behind that flag. Keep your existing prompt and tool definitions.
Start at reasoning_effort: "low" and only step up if you see quality regressions.
Measure two things in parallel: P50 wall clock latency end-to-end, and your existing quality eval if you have one.
For agent loops, log per-step latency. The wins are usually concentrated in the tool-call rounds, not the final answer.
If you are running streaming UIs, make sure your frontend can actually keep up. I have seen apps where the model finishes before the React render loop catches up. Real problem to have.

The whole switch is a half-day of work for most apps. The decision after that is a quality call against your own evals.

Final Read

Mercury 2 is the first model that makes me think the autoregressive monoculture has a real challenger, not just a faster sibling. The benchmarks land in the right zip code. The price is aggressive. The OpenAI compatibility kills the integration cost. And the reasoning-level knob means you do not have to pick a single point on the latency-quality curve for your whole app.

I would not throw out my Sonnet or GPT-5 calls today. I would route every latency-sensitive path through Mercury 2 and start measuring. That is where the wins live, and that is where the architecture actually pays off.

If you want the original deep dive, the Mercury 2 video walkthrough on the channel covers the demo and the diffusion explainer. If you want the broader context on agent frameworks, the agent frameworks comparison is the right next read.

Mercury 2: The LLM That Doesn't Generate Like an LLM

Diffusion Language Models: How Mercury Changed the LLM Speed Game

AI Agent Frameworks Compared: LangGraph vs CrewAI vs AutoGen vs Claude Agent SDK vs Vercel AI SDK

Why This Guide Exists

The Numbers That Matter for Production

Wiring Up the API

Tool Use Without Wrapper Hell

Model Context Protocol: A Production Guide To Building MCP Servers

NVIDIA Nemotron 3 Super: A Developer's Guide to the 120B Hybrid MoE

Open-Source MCP Servers Worth Installing in 2026

OpenAI AgentKit in Production: An Honest Builder's Review

Structured Outputs

Choosing a Reasoning Level

Where Mercury 2 Beats Autoregressive

Where I Would Not Reach for It Yet

Diffusion vs Autoregressive: The Mental Model

Migration Checklist

Final Read

Comments

Related Tools

Jan

v0

Aider

Vercel AI SDK

Apps from Developers Digest

Migrate

Cost Tape

Cron

Related Guides

Getting Started with DevDigest CLI

Run AI Models Locally with Ollama and LM Studio

Getting Started with Claude Code

Related Videos

Nimbalyst: The Open-Source Visual Workspace for Building with Codex and Claude Code

Related Posts

Mercury 2: The LLM That Doesn't Generate Like an LLM

Diffusion Language Models: How Mercury Changed the LLM Speed Game

AI Agent Frameworks Compared: LangGraph vs CrewAI vs AutoGen vs Claude Agent SDK vs Vercel AI SDK

Claude Skills: A technical deep dive into Anthropic's new approach to AI context management

Build a Full Stack AI SaaS Application in 60 Minutes

GPT-5: OpenAI's Most Capable Model

Get Smarter About AI Dev

Mercury 2: The LLM That Doesn't Generate Like an LLM

Diffusion Language Models: How Mercury Changed the LLM Speed Game

AI Agent Frameworks Compared: LangGraph vs CrewAI vs AutoGen vs Claude Agent SDK vs Vercel AI SDK

Why This Guide Exists

The Numbers That Matter for Production

Wiring Up the API

Tool Use Without Wrapper Hell

Model Context Protocol: A Production Guide To Building MCP Servers

NVIDIA Nemotron 3 Super: A Developer's Guide to the 120B Hybrid MoE

Open-Source MCP Servers Worth Installing in 2026

OpenAI AgentKit in Production: An Honest Builder's Review

Structured Outputs

Choosing a Reasoning Level

Where Mercury 2 Beats Autoregressive

Where I Would Not Reach for It Yet

Diffusion vs Autoregressive: The Mental Model

Migration Checklist

Final Read

Comments

Related Tools

Jan

v0

Aider

Vercel AI SDK

Apps from Developers Digest

Migrate

Cost Tape

Cron

Related Guides

Getting Started with DevDigest CLI

Run AI Models Locally with Ollama and LM Studio

Getting Started with Claude Code

Related Videos

Nimbalyst: The Open-Source Visual Workspace for Building with Codex and Claude Code

Related Posts

Mercury 2: The LLM That Doesn't Generate Like an LLM

Diffusion Language Models: How Mercury Changed the LLM Speed Game

AI Agent Frameworks Compared: LangGraph vs CrewAI vs AutoGen vs Claude Agent SDK vs Vercel AI SDK