June 10, 2026
Google has released DiffusionGemma, an experimental open model that uses text diffusion to generate text up to four times faster than conventional language models. It is available under a permissive Apache 2.0 license.
What is different
Most language models write one token at a time, left to right. DiffusionGemma instead drafts an entire block of 256 tokens at once and refines it over several passes, the way image generators sharpen a picture from noise. Google likens the shift from a typewriter to a printing press. The 26-billion-parameter Mixture-of-Experts model activates only about 3.8 billion parameters per step and, when quantized, fits within the 18GB of memory on high-end consumer GPUs.
Speed and trade-offs
Google reports more than 1,000 tokens per second on a single NVIDIA H100 and over 700 on a GeForce RTX 5090. Because it generates in parallel and every token can attend to the others, the model suits interactive, local workflows such as in-line editing and code infilling. The company is candid about the trade-off: output quality is lower than its standard Gemma 4 models, which it still recommends for production-grade results. The speedup also mainly helps local, single-user inference rather than high-volume cloud serving, and may be limited on Apple Silicon’s memory-bound architecture.
Availability
The weights are on Hugging Face, with support across tooling including MLX, vLLM, and Hugging Face Transformers, and optimizations developed with NVIDIA. Google has also published a developer guide and fine-tuning tutorials.
Source: Google, “DiffusionGemma: 4x faster text generation,” June 10, 2026. Facts attributed to Google.