TL;DR
Google released DiffusionGemma today, a 26B MoE open model that generates entire 256-token blocks in parallel instead of one token at a time. Here is what that means for latency, local inference, and the post-autoregressive landscape.
Read next
Inception Labs shipped the first reasoning model built on diffusion instead of autoregressive generation. Over 1,000 tokens per second, competitive benchmarks, and a fundamentally different approach to how AI generates text.
8 min readEvery major AI coding tool just went through a pricing shift. Here are the exact numbers for Cursor, GitHub Copilot, Claude Code, Windsurf/Devin, and the Anthropic API - verified from live pricing pages on June 10, 2026.
9 min readAlibaba's newest Qwen release claims flagship-level coding in a 27B dense model. Here is why dense matters, where it fits against the 480B MoE coder, and what it unlocks for local inference.
7 min readEvery language model you use today is, in the words of the Google blog post that landed on Hacker News this morning, a typewriter. One token, then the next, left to right, each step waiting for the previous one to finish. DiffusionGemma, Google's new experimental open model, tries to be a printing press instead - stamping out an entire paragraph at once.
The announcement hit 237 points on Hacker News within hours of posting on June 10, 2026. The thread is worth reading not just for the enthusiasm but for the pointed skepticism, which tells you as much about the tradeoffs as the launch blog does.
Last updated: June 10, 2026
DiffusionGemma is a 26B Mixture of Experts model released under Apache 2.0. During inference it activates only 3.8B parameters, which means it fits within 18 GB of VRAM when quantized - accessible on a high-end consumer GPU.
The headline speed claims from Google's own post: 1,000+ tokens per second on a single NVIDIA H100, 700+ tokens per second on an RTX 5090. Those numbers come with a footnote that matters: the speedup is designed for local and low-concurrency inference. The blog is explicit that "in high-QPS cloud serving, autoregressive models can be deployed to saturate compute efficiently, so DiffusionGemma's parallel decoding offers diminishing returns and can result in higher serving costs."
The weights are available on Hugging Face right now. Serving integrations are live for MLX, vLLM (with Red Hat support), and Hugging Face Transformers. llama.cpp support is described as "arriving soon." NVIDIA is hosting a free inference endpoint at build.nvidia.com. The model is also available through Google's Gemini Enterprise Agent Platform Model Garden and NVIDIA NIM.
The mechanism is meaningfully different from anything in the autoregressive stack, so it is worth being precise.
A standard language model generates token N only after token N-1 is committed. This is causal, left-to-right, and sequential by design. It is also efficient at scale because cloud providers can batch thousands of requests together so the GPU is never idle.
DiffusionGemma works differently:
The developer guide from Google's blog team describes the underlying mechanism as "Uniform State Diffusion" - every canvas position attends to all others simultaneously during each denoising step. That bidirectional attention is the architectural departure. Autoregressive models use causal masks specifically to prevent future tokens from influencing earlier ones. DiffusionGemma removes that constraint entirely within a block.
This is not speculative decoding. The HN thread had several comments conflating the two. Speculative decoding uses a smaller draft model to propose tokens and a larger model to verify them, still operating sequentially. DiffusionGemma generates the entire block and refines it in parallel.
HN user samuelknight put it clearly: "On edge you have a different problem: your inference accelerator is starved while sloshing GB of weights back and forth from RAM. That's because the consumer RAM like LPDDRx/GDDRx is lower bandwidth than HBM, and the requests are serial so you can't batch compute common weights. Diffusion can compute tokens in parallel which relieves the memory bandwidth bottleneck."
The Google blog post makes the same point in different language. Autoregressive inference on a local GPU is memory-bandwidth-bound. You load the full weight tensor once per token. DiffusionGemma shifts the bottleneck to compute by giving the GPU a large parallel workload for each forward pass - the tensor cores that would sit idle during sequential generation get fully utilized instead.
This is why the H100 and RTX 5090 numbers are so much higher than what developers are used to seeing from comparably sized models. At 26B total parameters with only 3.8B active, the arithmetic intensity of generating 256 tokens in one pass is dramatically higher than generating one token at a time.
Google's footnote on Apple Silicon is worth quoting directly: "unified-memory architectures like those in Apple Silicon Macs - which are often memory-bandwidth-bound rather than compute-bound during inference - may not see the same acceleration over autoregressive models like Gemma 4." If you are running on an M-series Mac, the gains may be much smaller than the headline numbers suggest.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Jun 10, 2026 • 9 min read
Jun 10, 2026 • 7 min read
Jun 10, 2026 • 8 min read
Jun 10, 2026 • 8 min read
Google does not hide the quality gap. The launch post states plainly: "DiffusionGemma's overall output quality is lower than standard Gemma 4. For applications that demand maximum quality, we recommend deploying standard Gemma 4."
The HN thread was direct about this. User famouswaffles wrote: "The quality gap between diffusion and autoregressive models is pretty stark. I mean just look at the benchmarks here. Large dropoffs, with the hardest benchmarks seeing the largest drops. On top of that, almost all the speed benefits of diffusion models become negated at scale."
User horsawlarway offered a more optimistic reframe: "Does the quality drop from a diffusion model outweigh the quality bump from using a larger model? Because if not... sounds like diffusion models have a lot of space to thrive." The argument being that if diffusion lets you run a 70B model at the speed of a 20B autoregressive model, the net quality might come out ahead even if diffusion introduces some degradation.
| Autoregressive (e.g. Gemma 4) | Diffusion (DiffusionGemma) | |
|---|---|---|
| Latency model | Sequential - each token depends on the last; latency scales with output length | Block parallel - 256 tokens per denoising cycle; latency scales with number of cycles |
| Output quality | Established SOTA across most benchmarks | Measurably lower on current models, especially on hard reasoning tasks |
| Streaming UX | Natural token-by-token streaming | Block-level output; text appears in chunks, not incrementally |
| Local inference efficiency | Memory-bandwidth-bound on consumer hardware | Compute-bound; better GPU utilization on dedicated hardware |
| Cloud serving at scale | Efficient - batching saturates compute across many users | Diminishing returns at high QPS; potentially higher serving cost per query |
| Maturity | Production-ready, wide tooling support | Experimental; llama.cpp support pending |
The use cases Google calls out - inline editing, code infilling, rapid iteration, non-linear text structures - are not coincidental. These are exactly the cases where bidirectional attention is structurally advantageous. Autoregressive models cannot revise an earlier token without running a new forward pass. DiffusionGemma's denoising loop is self-correcting within a block by design.
The Sudoku demonstration in the developer guide is a good concrete example. An autoregressive model solving Sudoku must commit each digit before seeing what comes after. A diffusion model can propagate constraints across the entire grid simultaneously. Fine-tuning DiffusionGemma on a Sudoku dataset raised correctness from near zero to 80%, with the fine-tuned model converging in 12 denoising steps versus 48 for the base model.
For agents running locally - the kind that drive autocomplete, in-editor refactoring, or fast iteration loops - the speed profile changes what is practical. HN user vineyardmike described using Mercury (Inception Labs' commercial diffusion LLM, which we covered in an earlier post) as feeling like "pair-programming instead of the SOTA agentic experience of prompting and waiting." That subjective experience of responsiveness has real product implications when you are building interactive tooling.
One comment in the thread that deserves attention came from chc4: "it just me that thinks its kinda weird that they conflate speed in tokens/second and latency, when i think of latency as time to first token? like it generates an entire paragraph of tokens faster but wouldn't it still be slower if your reply is only 1 word because it has to do the entire 256 tokens as a chunk." That is a legitimate point. Time to first visible output for short responses may actually be worse with diffusion, because the model must complete a full denoising cycle before committing a block. The streaming UX is fundamentally different.
The weights are on Hugging Face under Apache 2.0 as google/diffusiongemma-26B-A4B-it. The developer guide at developers.googleblog.com walks through serving with vLLM and Hugging Face Transformers. The Hackable Diffusion JAX toolbox at github.com/google/hackable_diffusion is the official fine-tuning path. Unsloth also has documented support. NVIDIA's free endpoint at build.nvidia.com/google/diffusiongemma-26b-a4b-it requires account verification but is live today. Simon Willison posted a working demo generating SVG in the HN thread.
For quantized deployment on consumer hardware, NVIDIA published optimization guidance for GeForce RTX 5090 and 4090, including NVFP4 kernel support on Hopper and Blackwell architectures.
We have now seen commercial diffusion LLMs from Inception Labs (Mercury, Mercury 2) and an open research model from Google. The research community has been exploring discrete diffusion for text for several years, but the combination of scale, open weights, and production-grade tooling integrations from a major lab is new. HN user kkukshtel called it "the sort of left-field rumble that turns into a quake in 5 years," which captures the community mood: not dismissing it, not overclaiming, watching carefully.
The honest state of play as of today is that diffusion text generation is faster on dedicated hardware for local and low-concurrency workloads, worse on quality by a measurable amount, and structurally interesting for tasks where bidirectionality matters (infilling, constrained generation, iterative editing). It is not a replacement for autoregressive models in production cloud deployments yet. Whether the quality gap closes over the next generation of training is the open question.
DiffusionGemma is a 26B Mixture of Experts open model from Google that generates text using a diffusion process instead of autoregressive token-by-token generation. It starts with a block of random placeholder tokens and iteratively refines them in parallel, which allows it to generate 256 tokens per forward pass rather than one. Regular Gemma 4 uses standard autoregressive decoding and produces higher quality output but is slower on local hardware.
Google claims 1,000+ tokens per second on a single NVIDIA H100 and 700+ tokens per second on an RTX 5090. The model activates only 3.8B parameters during inference (from a 26B total parameter MoE architecture) and fits within 18 GB of VRAM when quantized. The speed advantage is most pronounced on dedicated GPU hardware with high arithmetic throughput. Apple Silicon users may see limited improvement due to the unified memory architecture.
Google describes it as experimental. Output quality is acknowledged to be lower than standard Gemma 4, with larger gaps on harder reasoning benchmarks. llama.cpp support is not yet available. The model is suited for research, rapid prototyping, local development workflows, and latency-sensitive interactive applications where some quality tradeoff is acceptable. For production workloads requiring maximum quality, Google recommends standard Gemma 4.
The weights are available on Hugging Face at google/diffusiongemma-26B-A4B-it under an Apache 2.0 license. NVIDIA hosts a free inference endpoint at build.nvidia.com that requires account creation. The model is also available in Google's Gemini Enterprise Agent Platform Model Garden. Serving support is live for MLX, vLLM, and Hugging Face Transformers. Fine-tuning tutorials are available via the Hackable Diffusion JAX toolbox and Unsloth.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Google's open-source coding CLI. Free tier with Gemini 2.5 Pro. Supports tool use, file editing, shell commands. 1M toke...
View ToolGoogle's frontier model family. Gemini 2.5 Pro has 1M token context and top-tier coding benchmarks. Gemini 3 Pro pushes...
View ToolThe TypeScript toolkit for building AI apps. Unified API across OpenAI, Anthropic, Google. Streaming, tool calling, stru...
View ToolAI voice dictation for macOS. Works in any app - code editors, browsers, notes. Understands context and formats output...
View ToolCatch silent GA breakage before a quarter of data goes missing.
View AppTurn community complaints and requests into validated product bets and weekly briefs.
View AppDocument API key ownership, rotation context, and integration notes without storing secrets.
View AppPrevent bloating the main conversation with research or exploration.
Claude CodeInstall the dd CLI and scaffold your first AI-powered app in under a minute.
Getting StartedConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI Agents
Inception Labs shipped the first reasoning model built on diffusion instead of autoregressive generation. Over 1,000 tok...
Every major AI coding tool just went through a pricing shift. Here are the exact numbers for Cursor, GitHub Copilot, Cla...

Antigravity marks the first release from a team that originated at Windsurf. After selling non-exclusive IP rights, the...
A hands-on look at Mastra, the open source TypeScript framework for building production-ready AI agents and workflows --...

Google's skills repo is a useful signal: agents do not just need generic coding help. They need product-specific operati...

Opus 4.7 is here. Sharper coding, longer agentic runs, better tool use, and a price that finally makes Opus livable for...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.