DiffusionGemma: Google Bets Diffusion Can Make Text Generation 4x Faster

Q: Where can I download or try DiffusionGemma today?

The weights are available on Hugging Face at `google/diffusiongemma-26B-A4B-it` under an Apache 2.0 license. NVIDIA hosts a free inference endpoint at build.nvidia.com that requires account creation. The model is also available in Google's Gemini Enterprise Agent Platform Model Garden. Serving support is live for MLX, vLLM, and Hugging Face Transformers. Fine-tuning tutorials are available via the Hackable Diffusion JAX toolbox and Unsloth. ---

Every language model you use today is, in the words of the Google blog post that landed on Hacker News this morning, a typewriter. One token, then the next, left to right, each step waiting for the previous one to finish. DiffusionGemma, Google's new experimental open model, tries to be a printing press instead - stamping out an entire paragraph at once.

The announcement hit 237 points on Hacker News within hours of posting on June 10, 2026. The thread is worth reading not just for the enthusiasm but for the pointed skepticism, which tells you as much about the tradeoffs as the launch blog does.

Last updated: June 10, 2026

What Google actually shipped#

DiffusionGemma is a 26B Mixture of Experts model released under Apache 2.0. During inference it activates only 3.8B parameters, which means it fits within 18 GB of VRAM when quantized - accessible on a high-end consumer GPU.

The headline speed claims from Google's own post: 1,000+ tokens per second on a single NVIDIA H100, 700+ tokens per second on an RTX 5090. Those numbers come with a footnote that matters: the speedup is designed for local and low-concurrency inference. The blog is explicit that "in high-QPS cloud serving, autoregressive models can be deployed to saturate compute efficiently, so DiffusionGemma's parallel decoding offers diminishing returns and can result in higher serving costs."

The weights are available on Hugging Face right now. Serving integrations are live for MLX, vLLM (with Red Hat support), and Hugging Face Transformers. llama.cpp support is described as "arriving soon." NVIDIA is hosting a free inference endpoint at build.nvidia.com. The model is also available through Google's Gemini Enterprise Agent Platform Model Garden and NVIDIA NIM.

How diffusion text generation works#

The mechanism is meaningfully different from anything in the autoregressive stack, so it is worth being precise.

A standard language model generates token N only after token N-1 is committed. This is causal, left-to-right, and sequential by design. It is also efficient at scale because cloud providers can batch thousands of requests together so the GPU is never idle.

DiffusionGemma works differently:

The canvas. The model starts with a 256-token block filled with random placeholder tokens.
Iterative denoising. Over multiple forward passes, highly confident tokens lock in first. Those tokens then serve as context clues to resolve adjacent positions. The sequence converges from noise to coherent text.
Block autoregressive chaining. For outputs longer than 256 tokens, once a block is fully denoised it is committed to the KV cache and the model initializes a fresh canvas for the next block.

The developer guide from Google's blog team describes the underlying mechanism as "Uniform State Diffusion" - every canvas position attends to all others simultaneously during each denoising step. That bidirectional attention is the architectural departure. Autoregressive models use causal masks specifically to prevent future tokens from influencing earlier ones. DiffusionGemma removes that constraint entirely within a block.

This is not speculative decoding. The HN thread had several comments conflating the two. Speculative decoding uses a smaller draft model to propose tokens and a larger model to verify them, still operating sequentially. DiffusionGemma generates the entire block and refines it in parallel.

Why local inference is where this matters#

HN user samuelknight put it clearly: "On edge you have a different problem: your inference accelerator is starved while sloshing GB of weights back and forth from RAM. That's because the consumer RAM like LPDDRx/GDDRx is lower bandwidth than HBM, and the requests are serial so you can't batch compute common weights. Diffusion can compute tokens in parallel which relieves the memory bandwidth bottleneck."

The Google blog post makes the same point in different language. Autoregressive inference on a local GPU is memory-bandwidth-bound. You load the full weight tensor once per token. DiffusionGemma shifts the bottleneck to compute by giving the GPU a large parallel workload for each forward pass - the tensor cores that would sit idle during sequential generation get fully utilized instead.

This is why the H100 and RTX 5090 numbers are so much higher than what developers are used to seeing from comparably sized models. At 26B total parameters with only 3.8B active, the arithmetic intensity of generating 256 tokens in one pass is dramatically higher than generating one token at a time.

Google's footnote on Apple Silicon is worth quoting directly: "unified-memory architectures like those in Apple Silicon Macs - which are often memory-bandwidth-bound rather than compute-bound during inference - may not see the same acceleration over autoregressive models like Gemma 4." If you are running on an M-series Mac, the gains may be much smaller than the headline numbers suggest.

Newsletter

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools, delivered free every week.

From the archive

Claude Fable 5 API: Production Integration Patterns, Rate Limits, and Migration Gotchas

Jun 10, 2026 • 9 min read

Fable 5 on AWS Bedrock: When Your Data Leaves the AWS Boundary

Jun 10, 2026 • 7 min read

Fable 5 Broke Enterprise ZDR Agreements: What Dev Teams Must Do Now

Jun 10, 2026 • 8 min read

Fable 5 for Government and Regulated Teams: The GovCloud Question

Jun 10, 2026 • 8 min read

The quality tradeoff is real and acknowledged#

Google does not hide the quality gap. The launch post states plainly: "DiffusionGemma's overall output quality is lower than standard Gemma 4. For applications that demand maximum quality, we recommend deploying standard Gemma 4."

The HN thread was direct about this. User famouswaffles wrote: "The quality gap between diffusion and autoregressive models is pretty stark. I mean just look at the benchmarks here. Large dropoffs, with the hardest benchmarks seeing the largest drops. On top of that, almost all the speed benefits of diffusion models become negated at scale."

User horsawlarway offered a more optimistic reframe: "Does the quality drop from a diffusion model outweigh the quality bump from using a larger model? Because if not... sounds like diffusion models have a lot of space to thrive." The argument being that if diffusion lets you run a 70B model at the speed of a 20B autoregressive model, the net quality might come out ahead even if diffusion introduces some degradation.

Autoregressive vs. diffusion at a glance#

	Autoregressive (e.g. Gemma 4)	Diffusion (DiffusionGemma)
Latency model	Sequential - each token depends on the last; latency scales with output length	Block parallel - 256 tokens per denoising cycle; latency scales with number of cycles
Output quality	Established SOTA across most benchmarks	Measurably lower on current models, especially on hard reasoning tasks
Streaming UX	Natural token-by-token streaming	Block-level output; text appears in chunks, not incrementally
Local inference efficiency	Memory-bandwidth-bound on consumer hardware	Compute-bound; better GPU utilization on dedicated hardware
Cloud serving at scale	Efficient - batching saturates compute across many users	Diminishing returns at high QPS; potentially higher serving cost per query
Maturity	Production-ready, wide tooling support	Experimental; llama.cpp support pending

What it means for developers building real things#

The use cases Google calls out - inline editing, code infilling, rapid iteration, non-linear text structures - are not coincidental. These are exactly the cases where bidirectional attention is structurally advantageous. Autoregressive models cannot revise an earlier token without running a new forward pass. DiffusionGemma's denoising loop is self-correcting within a block by design.

The Sudoku demonstration in the developer guide is a good concrete example. An autoregressive model solving Sudoku must commit each digit before seeing what comes after. A diffusion model can propagate constraints across the entire grid simultaneously. Fine-tuning DiffusionGemma on a Sudoku dataset raised correctness from near zero to 80%, with the fine-tuned model converging in 12 denoising steps versus 48 for the base model.

For agents running locally - the kind that drive autocomplete, in-editor refactoring, or fast iteration loops - the speed profile changes what is practical. HN user vineyardmike described using Mercury (Inception Labs' commercial diffusion LLM, which we covered in an earlier post) as feeling like "pair-programming instead of the SOTA agentic experience of prompting and waiting." That subjective experience of responsiveness has real product implications when you are building interactive tooling.

One comment in the thread that deserves attention came from chc4: "it just me that thinks its kinda weird that they conflate speed in tokens/second and latency, when i think of latency as time to first token? like it generates an entire paragraph of tokens faster but wouldn't it still be slower if your reply is only 1 word because it has to do the entire 256 tokens as a chunk." That is a legitimate point. Time to first visible output for short responses may actually be worse with diffusion, because the model must complete a full denoising cycle before committing a block. The streaming UX is fundamentally different.

How to try it#

The weights are on Hugging Face under Apache 2.0 as google/diffusiongemma-26B-A4B-it. The developer guide at developers.googleblog.com walks through serving with vLLM and Hugging Face Transformers. The Hackable Diffusion JAX toolbox at github.com/google/hackable_diffusion is the official fine-tuning path. Unsloth also has documented support. NVIDIA's free endpoint at build.nvidia.com/google/diffusiongemma-26b-a4b-it requires account verification but is live today. Simon Willison posted a working demo generating SVG in the HN thread.

For quantized deployment on consumer hardware, NVIDIA published optimization guidance for GeForce RTX 5090 and 4090, including NVFP4 kernel support on Hopper and Blackwell architectures.

What this signals#

We have now seen commercial diffusion LLMs from Inception Labs (Mercury, Mercury 2) and an open research model from Google. The research community has been exploring discrete diffusion for text for several years, but the combination of scale, open weights, and production-grade tooling integrations from a major lab is new. HN user kkukshtel called it "the sort of left-field rumble that turns into a quake in 5 years," which captures the community mood: not dismissing it, not overclaiming, watching carefully.

The honest state of play as of today is that diffusion text generation is faster on dedicated hardware for local and low-concurrency workloads, worse on quality by a measurable amount, and structurally interesting for tasks where bidirectionality matters (infilling, constrained generation, iterative editing). It is not a replacement for autoregressive models in production cloud deployments yet. Whether the quality gap closes over the next generation of training is the open question.

FAQ#

What is DiffusionGemma and how is it different from regular Gemma?#

DiffusionGemma is a 26B Mixture of Experts open model from Google that generates text using a diffusion process instead of autoregressive token-by-token generation. It starts with a block of random placeholder tokens and iteratively refines them in parallel, which allows it to generate 256 tokens per forward pass rather than one. Regular Gemma 4 uses standard autoregressive decoding and produces higher quality output but is slower on local hardware.

How fast is DiffusionGemma and what hardware does it need?#

Google claims 1,000+ tokens per second on a single NVIDIA H100 and 700+ tokens per second on an RTX 5090. The model activates only 3.8B parameters during inference (from a 26B total parameter MoE architecture) and fits within 18 GB of VRAM when quantized. The speed advantage is most pronounced on dedicated GPU hardware with high arithmetic throughput. Apple Silicon users may see limited improvement due to the unified memory architecture.

Is DiffusionGemma production ready?#

Google describes it as experimental. Output quality is acknowledged to be lower than standard Gemma 4, with larger gaps on harder reasoning benchmarks. llama.cpp support is not yet available. The model is suited for research, rapid prototyping, local development workflows, and latency-sensitive interactive applications where some quality tradeoff is acceptable. For production workloads requiring maximum quality, Google recommends standard Gemma 4.

Where can I download or try DiffusionGemma today?#

The weights are available on Hugging Face at google/diffusiongemma-26B-A4B-it under an Apache 2.0 license. NVIDIA hosts a free inference endpoint at build.nvidia.com that requires account creation. The model is also available in Google's Gemini Enterprise Agent Platform Model Garden. Serving support is live for MLX, vLLM, and Hugging Face Transformers. Fine-tuning tutorials are available via the Hackable Diffusion JAX toolbox and Unsloth.

Sources#

Google Blog announcement: https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation/
Google Developer Guide: https://developers.googleblog.com/en/diffusiongemma-the-developer-guide
Hacker News thread (237 points): https://news.ycombinator.com/item?id=48478471
HN Algolia API (story metadata): https://hn.algolia.com/api/v1/search?query=diffusiongemma&tags=story
Developers Digest Mercury 2 post (slug verified): https://www.developersdigest.tech/blog/mercury-2-diffusion-llm

Last updated: June 10, 2026

What Google actually shipped#

How diffusion text generation works#

The mechanism is meaningfully different from anything in the autoregressive stack, so it is worth being precise.

DiffusionGemma works differently:

The canvas. The model starts with a 256-token block filled with random placeholder tokens.
Iterative denoising. Over multiple forward passes, highly confident tokens lock in first. Those tokens then serve as context clues to resolve adjacent positions. The sequence converges from noise to coherent text.
Block autoregressive chaining. For outputs longer than 256 tokens, once a block is fully denoised it is committed to the KV cache and the model initializes a fresh canvas for the next block.

Why local inference is where this matters#

Newsletter

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools, delivered free every week.

From the archive

The quality tradeoff is real and acknowledged#

Autoregressive vs. diffusion at a glance#

	Autoregressive (e.g. Gemma 4)	Diffusion (DiffusionGemma)
Latency model	Sequential - each token depends on the last; latency scales with output length	Block parallel - 256 tokens per denoising cycle; latency scales with number of cycles
Output quality	Established SOTA across most benchmarks	Measurably lower on current models, especially on hard reasoning tasks
Streaming UX	Natural token-by-token streaming	Block-level output; text appears in chunks, not incrementally
Local inference efficiency	Memory-bandwidth-bound on consumer hardware	Compute-bound; better GPU utilization on dedicated hardware
Cloud serving at scale	Efficient - batching saturates compute across many users	Diminishing returns at high QPS; potentially higher serving cost per query
Maturity	Production-ready, wide tooling support	Experimental; llama.cpp support pending

What it means for developers building real things#

How to try it#

For quantized deployment on consumer hardware, NVIDIA published optimization guidance for GeForce RTX 5090 and 4090, including NVFP4 kernel support on Hopper and Blackwell architectures.

What this signals#

FAQ#

What is DiffusionGemma and how is it different from regular Gemma?#

How fast is DiffusionGemma and what hardware does it need?#

Is DiffusionGemma production ready?#

Where can I download or try DiffusionGemma today?#

Sources#

Google Blog announcement: https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation/
Google Developer Guide: https://developers.googleblog.com/en/diffusiongemma-the-developer-guide
Hacker News thread (237 points): https://news.ycombinator.com/item?id=48478471
HN Algolia API (story metadata): https://hn.algolia.com/api/v1/search?query=diffusiongemma&tags=story
Developers Digest Mercury 2 post (slug verified): https://www.developersdigest.tech/blog/mercury-2-diffusion-llm

What Google actually shipped#

How diffusion text generation works#

Why local inference is where this matters#

Claude Fable 5 API: Production Integration Patterns, Rate Limits, and Migration Gotchas

Fable 5 on AWS Bedrock: When Your Data Leaves the AWS Boundary

Fable 5 Broke Enterprise ZDR Agreements: What Dev Teams Must Do Now

Fable 5 for Government and Regulated Teams: The GovCloud Question

The quality tradeoff is real and acknowledged#

Autoregressive vs. diffusion at a glance#

What it means for developers building real things#

How to try it#

What this signals#

FAQ#

What is DiffusionGemma and how is it different from regular Gemma?#

How fast is DiffusionGemma and what hardware does it need?#

Is DiffusionGemma production ready?#

Where can I download or try DiffusionGemma today?#

Sources#

Mercury 2: The LLM That Doesn't Generate Like an LLM

AI Coding Tools Pricing Comparison 2026

Ornith-1.0: What an Open Source Self-Improving Coding Model Actually Means

Related Tools

Gemini CLI

Gemini

Vercel AI SDK

Wispr Flow

Apps from Developers Digest

DD GA

Community Insight Engine

Key Vault

Related Guides

Subagent Context Isolation - Claude Code

Getting Started with DevDigest CLI

Claude Code Setup Guide

Related Videos

Gemini CLI in 6 Minutes: Google's Free and Open-Source Coding Assistant

Genie 2: Google's New AI Model Turns One Image into Infinite Playable Worlds

Gemma 2 Aims for the Crown! Google's Latest Open-Weights 9B & 27B Models

Related Posts

Mercury 2: The LLM That Doesn't Generate Like an LLM

AI Coding Tools Pricing Comparison 2026

Noam Shazeer Joins OpenAI After Two Years Back at Google

Antigravity: Google's Agentic Code Editor

Echo Claims Fable-Level Results at One-Third the Cost Using Open-Weight Models

Kimi K3 in 10 Minutes: Moonshot AI's 2.8T Open Model, API Setup, Pricing, and Benchmarks

Build with the member tools

Get Smarter About AI Dev

What Google actually shipped#

How diffusion text generation works#

Why local inference is where this matters#

Claude Fable 5 API: Production Integration Patterns, Rate Limits, and Migration Gotchas

Fable 5 on AWS Bedrock: When Your Data Leaves the AWS Boundary

Fable 5 Broke Enterprise ZDR Agreements: What Dev Teams Must Do Now

Fable 5 for Government and Regulated Teams: The GovCloud Question

The quality tradeoff is real and acknowledged#

Autoregressive vs. diffusion at a glance#

What it means for developers building real things#

How to try it#

What this signals#

FAQ#

What is DiffusionGemma and how is it different from regular Gemma?#

How fast is DiffusionGemma and what hardware does it need?#

Is DiffusionGemma production ready?#

Where can I download or try DiffusionGemma today?#

Sources#

Mercury 2: The LLM That Doesn't Generate Like an LLM

AI Coding Tools Pricing Comparison 2026

Ornith-1.0: What an Open Source Self-Improving Coding Model Actually Means

Related Tools

Gemini CLI

Gemini

Vercel AI SDK

Wispr Flow

Apps from Developers Digest

DD GA

Community Insight Engine

Key Vault

Related Guides

Subagent Context Isolation - Claude Code

Getting Started with DevDigest CLI