
Mercury: A New Diffusion LLM In today's video, I dive into the exciting launch of Inception Labs' Mercury, the first commercial-grade diffusion large language model. Unlike traditional autoregressive models, Mercury uses a coarse-to-fine approach, drastically reducing inference costs and latency. It's capable of generating over 1000 tokens per second on Nvidia H100 hardware, making it significantly faster than its competitors like GPT-4o mini and Claude 3.5 Haiku. We explore its implementations, visualizations, and potential impact on AI-driven applications. Check out the visual representation of its diffusion process and learn about its impressive benchmarks. If you enjoy cutting-edge AI developments, this video is for you! 🌟🚀 00:00 Introduction to Mercury: The First Commercial Diffusion LLM 00:30 Understanding Diffusion Models 01:17 Performance and Speed Comparison 02:04 Real-World Applications and Testing 03:38 Technical Insights and Benchmarks 06:09 Future Prospects and Conclusion
--- type: transcript date: 2025-02-27 youtube_id: KMuXaSQCfro --- # Transcript: Diffusion Large Language Models Are Here just today Inception Labs introduce Mercury which is the first commercial grade diffusion large language model the traditional large language models that we're used to using are Auto regressives they generate text sequentially and one of the drawbacks of that method is that the inference cost is both higher as well as the latency is slower as a result but what's really interesting with these diffusion large language models is what they do is they generate responses in a coar toine manner if we take a close look at the iterations on the right hand side here what you'll see is when it first starts on the first iteration it will have a very noisy response another way that you can think about this is actually how diffusion models work or image generation or video models work and how they work is they start out with a very noisy representation of the image or the video as you can see within this example here you pretty much can't even tell on the first frame what that image is but with each iteration it slowly gets sharper and after enough Generations you get these very good representations of whether it's photos or videos what's interesting with Inception Labs is they're the first commercial grade diffusion large language model so now just to put this into perspective this model is right around the strength of GPD 40 mini as well as Claude 3.5 hi coup although it runs 10 times faster and part of that is just this overall different architecture of how these models work for gp4 o mini as shown on the chart here it looks to be maybe in the 60 to 70 tokens per second range whereas the Mercury coder small model is about 750 and Mercury coder mini is over a th000 tokens per second what's interesting with this type of model is you can run it on Nvidia h100s at these incredibly fast speeds you don't need specialized chips for inferencing this model you can use the pre-existing Nvidia Hardware that already exists out there another interesting thing with this model is even in terms of its performance when it was tested amongst developers in the co-pilot Arena developers preferred Mercury's generation it ranked number one on speed as well as number two on quality they describe that mercury is the fastest code llm on the market just to show you another visual on what this looks like here is Claude chat GPT as well as mercury on the right hand side there and within 6 seconds Mercury was able to generate the response whereas for the same question it took chat gbt 36 seconds and Claw 28 seconds respectively now if you want to try this out you can go to chat. Inception labs. and what's really cool with this is you also have the animation representing the text diffusion process that you can enable if we turn that on and let's just try one of these preset examples here so create a JavaScript animation I'll go and send this in and we'll see that in just a number of seconds we have that generation but what's really cool with this is if you caught it is seeing that visual representation what it's doing let's generate a few more cuz I really find the diffusion effect just quite mesmerizing to look at so we can see it comes in very quickly what's happening is it's giving a very coarse output and over the iterations it's refining the output for us let's take a look at the blog post here we trained a diffusion large language model that are up to 10 times faster and cheaper than current llms pushing the frontier of intelligence and speed for large language models they offer enterprise clients access to code and generalist models via an API and on premises deployment current large language models are autor regressive meaning that they generate text left to right one token at a time generation is inherently sequential a token cannot be generated until all the text comes before it has been generated and generating each token toen requires evaluating a neural network with billions of parameters Frontier llm companies are betting on test time compute to increase reasoning and error correction capabilities but generating long reasoning traces come at the price of ballooning inference costs and unusable latency a paradigm shift is needed to make highquality AI Solutions truly accessible and as they described they say diffusion models provide a paradigm shift these models operate with a CO Define generation process where the output is refined from Pure Noise over a few denoising steps as Illustrated in the video above and here's a pretty compellent case that they make for diffusion models as they say because diffusion models are not restricted to only considering previous output they are better at reasoning at structuring their responses and because diffusion models can continually refine their output they can correct mistakes and hallucinations for these reasons diffusion Powers all the most prominent AI solutions for video image and audio generation including Sora mid and refusion however applications to diffusion to discrete data such as text and code have never been successful until now they mentioned that this Mercury coder model it supports all use cases so you can use it for reg use cases tool use agenic workflows the other interesting thing that they mention here is improvements are suggested by a neural network in our case a Transformer model which is trained on a large amount of data to globally improve the quality of Answers by modifying multiple tokens in parall as you might imagine as the name implies this diffusion large language model is specifically optimized for code generation in terms of the benchmarks what's really compelling are the results of these models so if we look at the human AOW of these models we have 88 as well as 90 on human AOW and in terms of the first release for a diffusion large language model these are incredibly strong results we can see how it compares to Gemini 2.0 flashlight Claude 3.5 highq GPD 40 mini as well as Quin and deep seek mind these models aren't comparing themselves to the frontier but as a first iteration it is really interesting to see where this is going to go because if these models are already at par with some of these lighter versions of models from these Frontier Labs it's going to be extremely interesting to watch over the coming months and years to see future releases of these diffusion large language models and how they stack up to some of these releases from the frontier labs they mentioned that even speed optimized autor regressive models only run at most 200 tokens per second whereas Mercury coder on a commodity Nvidia h100 can run over 1,000 tokens per second a 5x increase when compared to the frontier models those models are generally slower which run less than 50 tokens per second and this is a 20x speed up now the interesting thing that they call it here is that previously the only way that you could get these types of speed was through specialized Hardware such as Gro cerebrus as well as samb NOA they mentioned that our algorithmic improvements are orthogonal to Hardware acceleration and speedups would Compound on faster chips with that being said it would be really interesting to see how this model performs on the latest blackw chips from Nvidia and what types of speed we'll be able to see from inception now here's just another visual in terms of the speed and we can see how it Stacks up across all of the different models what's really interesting with this is when we compare these results to the smallest models that are available from both anthropic as well as open Ai and we can see that those are both sitting around about 60 tokens per second now Gemini 2.0 flashlight just came out around 200 tokens per second one thing that I do want to mention with speed is as they begin to roll this out to Enterprise or developers it will be interesting to see if they can maintain these speeds in production because that's the thing with CLA 3.5 haou as well as especially gp40 mini these are production end points which do get a lot of traffic so being able to actually facilitate and triage all of that demand because as you roll this out to developers and have to balance the different Hardware that you have but overall that's pretty much it for this video let me know your thoughts on this type of model would you be interested in using this type of thing within your application let me know your thoughts within the comments below otherwise if you found this video useful please comment share and subscribe otherwise until the next one
Weekly deep dives on AI agents, coding tools, and building with LLMs - delivered to your inbox.
Free forever. No spam.
Subscribe FreeNew tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.