
Mercury Two: The First Reasoning Diffusion LLM (1,000+ tokens/sec) - Speed Without Sacrificing Quality Inception Labs releases Mercury Two, a reasoning diffusion-based LLM that exceeds 1,000 tokens per second by generating multiple tokens per forward pass and iteratively refining output, rather than using autoregressive token-by-token generation. The script compares its throughput to Haiku (~89 t/s) and GPT-5 Mini (~71 t/s) and argues diffusion provides built-in error correction that can improve reasoning. Mercury Two is presented as maintaining quality while being fast, tying GPT-5 Mini on AIME 2025 at 91.1 and scoring competitively on GPQA and LiveCodeBench. A demo shows Mercury Two versus Haiku with selectable reasoning levels (instant/low/medium/high) and an agentic workflow that uses browser tool calls to find and summarize AI-related Hacker News stories and comments, emphasizing reduced latency in tool-heavy loops. The model supports tool use, structured outputs, RAG, and a 128k context window, and is priced at $0.25 per million input tokens and $0.75 per million output tokens. The script notes an OpenAI-compatible API (swap base URL/model string/API key) and mentions the demo uses Vercel's AI SDK, with code to be linked in the video description. It contrasts industry efforts focused on incremental autoregressive inference optimizations with Mercury Two's model-level approach, highlighting latency-sensitive use cases like voice interfaces, coding iteration, and chat apps, and encourages viewers to try the API platform and playground. š Try Mercury 2 API Platform: http://platform.inceptionlabs.ai/ Playground: https://chat.inceptionlabs.ai/ Inception is a Palo Alto-based AI lab founded by researchers from Stanford, UCLA, and Cornell - including Stefano Ermon, co-inventor of the diffusion methods behind modern image and video generation. Backed by Menlo Ventures, M12 (Microsoft), NVentures (NVIDIA), Databricks, and individual investors including Andrew Ng, Andrej Karpathy, and Eric Schmidt. š» Repo to Demo App Coming soon! 00:00 Mercury Two Breakthrough 00:20 Why Speed Used to Cost Quality 00:43 Diffusion Reasoning Explained 01:50 Speed and Benchmark Results 02:16 Live Demo Versus Haiku 02:40 Agentic Tool Use Example 03:40 API Setup and Pricing 04:36 Best Use Cases for Low Latency 05:16 Diffusion vs Autoregressive This video is sponsored by Inception Labs. 06:11 Industry Race and Big Picture 07:14 Wrap Up and Try the API
Weekly deep dives on AI agents, coding tools, and building with LLMs - delivered to your inbox.
Free forever. No spam.
Subscribe FreeNew tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
--- type: transcript date: 2026-02-24 youtube_id: quOe8V2n9rU --- # Transcript: Mercury 2: The First Reasoning Diffusion Language Model (1,000+ tokens/sec) Inception Labs has just released Mercury 2, which is a reasoning model that does over a,000 tokens per second. The crazy part with this, it's built on diffusion, not auto reggressive generation. Let me show you what that means. So, if you've been following my channel, you might have remembered that I covered the original Mercury model when it first came out. That video broke down how diffusion models could work for text generation. Now, one of the things when we think back to a couple years ago when fast inference originally came on the scene, it was with specialized hardware with companies like Grock, everyone got excited about the raw inference speed, and rightfully so. But the models that could run that kind of speed were generally pretty limited. They generally couldn't do tool calling quite well. They struggled with complex reasoning and they scored lower across most benchmarks. It was speed at a real cost. Now, Mercury 2 is completely different. This is the first reasoning diffusion LLM. It's not auto reggressive. is built on diffusion. This is the same fundamental approach that already won in image and video generation. And the people who built those diffusion methods are the ones who founded Inception Labs. Now they're applying those same techniques to large language models. So first off, what makes this different just from fast inference? The speed comes from the model itself, not better hardware optimization. And just to break this down a little bit, the way that diffusion works is it generates multiple tokens per forward pass instead of one. And the thing with this approach, this is not just an incremental improvement. It's a fundamentally different approach to how generation works. And that's why diffusion and reasoning actually work well together. Because what diffusion does is it revisits and refineses its output during generation. And it is built-in error correction. Whereas auto reggressive models commit to each token and move on. If they make a mistake early on, it can cascade into the subsequent steps of what the LLM generates. Whereas diffusion, it can catch and fix mistakes as it goes across the whole output. And the numbers back it up. Mercury 2 completes reasoning much faster than models that are out there. When we compare the throughput of over a,000 tokens per second, Haiku does 89 and GPD5 Mini does about 71. But speed without quality doesn't matter. Mercury achieves speed without compromising on quality. It ties GPD5 mini on AIM 2025 at 91.1 and scores competitively on GBQA and live codebench across the board. Now for a quick demonstration. So on the left hand side I have Haiku 4.5 selected and then on the right hand side I have Mercury 2. So the thing to note with Mercury 2 is you're going to be able to select the different level of reasoning if you're using this from the API. You can select instant low, medium or high. Right off the bat you'll notice that this is much much faster. But where these capabilities increasingly come into play is with the models capability with tool use. Now I want to do a quick demonstration of a little agentic application that I built. Okay. So now to demonstrate this further I'm going to say open up a browser and I want to go to hacker news and find the top stories related to AI. Once you found them, summarize what each of them are and then find the comments and what everyone is saying about the particular story. I'm going to go ahead and send in this task. Now, the really great thing with a model like this is by having the inference speed be as fast as it is, all of the different tool calls that you have within an application will occur much much quicker. Additionally, any context that we either have to generate for those tools or that we extract from what we're asking for is going to occur much much quicker because we have that faster inference time. So, we're able to move through the task much faster. Now, the other thing to know with the model is it has 128,000 tokens of context. So, you are going to be able to have and ingest an awful lot of context for the different tasks that you're asking for. If you're interested in any of the code that I'm showing here, I'll also put a link to this within the description of the video shortly after it goes live. Now, to dive into some of the details. So, first thing right off the bat, if you do want to try this out, they do have an OpenAI compatible API. You can swap this out with the base URL for Inception as well as the model string and API key, and you will be able to try this out within any application that you're leveraging an OpenAI model. in the demonstration that I just showed you that was leveraging the AI SDK from Burcell. And so you will be able to incorporate it easily into any of the agentic frameworks that you're leveraging. And now in terms of some of the use cases, as you saw me demonstrate, you will be able to leverage this with tool use. You can leverage it with structured outputs rag. It has [clears throat] 128,000 tokens of context like I mentioned, and it's to be priced at 25 cents per million tokens of input and 75 cents per million tokens of output. And just to put this price into perspective, this makes it one of the most cost competitive models, especially given its speed out there. If we consider the intelligent speed and price dynamics of this is this is going to be a very compelling option for a whole host of applications. Like I demonstrated in the application, this model is really going to shine in latency sensitive application. Anything that involves an agent loop where every tool call adds to wait time. Think things like voice interfaces. the P95 of voice interfaces, the latency really determines if the experience feels natural at all. Additionally, coding workflows and iteration cycles, you're going to be able to prompt and review and tweak with rapid succession. Additionally, even with chatbased consumerf facing applications similar to the one that I showed you, you're going to be able to benefit from having a much faster model. It's just going to feel that much more of a compelling experience when you're going to be able to have speed back up the actual capabilities of your application. Now to touch on how diffusion LLMs work in comparison to what we are used to. Every LLM that you use today is auto reggressive. It generates one token at a time sequentially. Token one is locked before token 2 begins. If the reasoning drifts early, too bad. It can only move forward. Think about your experience leveraging chatbt or claude. You know that it is sequentially based on what had just occurred. Whereas with diffusion models, it's completely different. Instead of generating left to right, it will start with noise and it will iteratively refine the output in parallel. You can also see this in image and video generation models before the final output. You'll see a rough representation of the image that will get finer and finer as it goes through more cycles. So you can sort of think of it like auto reggressive is like a typewriter where each keystroke is permanent whereas diffusion is like an editor looking at that entire document. It starts with a rough draft and it sharpens the whole thing with each pass. Okay. Now, if I take a look at artificial analysis, the entire industry is really racing towards solving the inference problem. OpenAI, Nvidia, Fireworks, Grock, you name it. Basically, everyone out there, billions and billions of dollars are being spent to make models faster. Nvidia just recently acquired Grock, for instance, for $20 billion, and that was in large part for their fast inference speed. But everyone is working with the auto reggressive paradigm. Better hardware, better kernels, quantization, distillation, real gains, but they're all incremental. You're squeezing more out of the same fundamental approach. And this is where diffusion models and inception took a fundamentally different path. They solved the speed bottleneck at the model level, not at the infrastructure level. And with reasoning and agentic workflow is becoming the norm and really table stakes in 2026. Sequential generation compounds latency. Think about it. Every step in an agent loop will add more weight time. Whereas Mercury 2, you don't have to choose between reasoning and speed anymore. You can effectively have them both within your application. So just to sum up, Mercury 2 is the first reasoning diffusion large language model. It's five times faster than speed optimized auto reggressive models with competitive quality. And it's a completely different approach to how AI generates text. So whether this becomes the future of how LLMs work, I definitely do not know. But the results are definitely real and here today. And the people behind it literally invented the techniques that we see in technologies like Sora or stable diffusion or flux or all of these different diffusion models that are out there. The power of those techniques in generating all of the beautiful images and videos that we see out there. This same technique they're applying to language models today. So, if you're interested in trying this out, I encourage you to check out the API platform today. Try out the playground. It's going to be within the description of the video. Go try it. See it for yourself. And if you found this video useful, please like, comment, share, and subscribe. Otherwise, until the next
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.