Google AI Releases DiffusionGemma, a 26B MoE Open Model Using Text Diffusion for Up to 4x Faster Generation

Google AI team including the Google DeepMind researchers have just released DiffusionGemma, an experimental open model for text generation. It uses text diffusion instead of standard autoregressive decoding. The model ships under a permissive Apache 2.0 license. Google positions it for devs and researchers exploring speed-critical, interactive local workflows. Examples include in-line editing, rapid iteration, and generating non-linear text structures.

Most language models in use today are autoregressive. They generate one token at a time, left to right. Each new token depends on the token before it. DiffusionGemma works differently. It generates entire blocks of text simultaneously, in parallel. On dedicated GPUs, this delivers up to 4x faster generation.

What is DiffusionGemma

DiffusionGemma is a 26B Mixture of Experts (MoE) model. It activates only 3.8B parameters during inference. It is built on the Gemma 4 backbone, specifically the 26B-A4B architecture. Google integrated a diffusion head onto that base.

The model is multimodal. It processes interleaved text, image, and video inputs. It generates text outputs from those inputs. The context window is 256K tokens, and it supports 140+ languages.

Quantized, the model fits within 18GB of VRAM. That places it inside high-end consumer GPU limits. On a single NVIDIA H100, it reaches 1000+ tokens per second. On an NVIDIA GeForce RTX 5090, it reaches 700+ tokens per second.

Google is very direct about the trade-off. DiffusionGemma prioritizes speed and parallel layout generation. Its overall output quality is lower than standard Gemma 4. For maximum quality production work, Google still recommends autoregressive Gemma 4.

How Text Diffusion Works

Text diffusion borrows its core idea from AI image generators. Those models start with visual static and refine it iteratively. DiffusionGemma applies the same pattern to text generation.

The process runs in three conceptual stages. First, the model starts with a canvas of random placeholder tokens. Second, it makes multiple passes over that canvas. It locks in high-confidence tokens and uses them as context. Third, the text converges into the final output.

Google calls the core mechanism Uniform State Diffusion. Highly confident tokens help resolve adjacent positions during denoising. The full sequence then snaps into focus over several passes.

In practice, the model denoises a 256-token canvas in parallel. It finalizes roughly 15-20 tokens per forward pass. That parallelism is what drives the throughput gains.

The model uses bidirectional attention during denoising. Every token on the canvas can attend to every other token. This is a sharp break from autoregressive models. Those models can only look backward at prior tokens.

That bidirectional context enables real-time self-correction. If a token’s confidence drops, the sampler can re-noise it. The model then replaces that token on a later pass. Autoregressive models cannot do this, since they commit each token once.

The Architecture

The technical advancement here is hardware utilization. For local GPU inference, the main bottleneck is memory bandwidth. Autoregressive models repeatedly load weights from memory per token. During single-user serving, the GPU spends most time waiting.

DiffusionGemma shifts the bottleneck from memory bandwidth to compute. It drafts and refines a 256-token canvas in parallel. This gives idle tensor cores a large parallel workload.

The model alternates two attention modes during inference. Prefill uses causal attention to ingest the prompt and write the KV cache. Denoising uses bidirectional attention to refine the canvas.

For longer outputs, DiffusionGemma uses Block Autoregressive Diffusion. Once a 256-token block is fully denoised, it commits to the KV cache. The model then starts a fresh canvas conditioned on prior history. This pairs parallel block speed with sequential autoregressive stability.

The architecture shares the same backbone as Gemma 4 26B A4B. Developers mainly need to implement a denoising step. That makes integration into existing serving frameworks simpler.

A clear example is the Sudoku showcase from Google’s developer guide. Autoregressive models struggle with strict, multivariable constrained puzzles. The base DiffusionGemma model solves roughly 0% of Sudoku puzzles. After a simple JAX supervised fine-tuning recipe, correctness rises to 80%. The fine-tuned model also stops earlier, cutting inference steps.

Interactive Demo: How DiffusionGemma Decodes in Parallel

The interactive visualizer below illustrates how DiffusionGemma decodes text, contrasted with a standard autoregressive model. Toggle between the two modes and press Run. In Autoregressive mode, tokens fill in one at a time, strictly left to right, taking one forward pass per token — the way most LLMs generate today. In Diffusion mode, the model starts from a canvas of masked placeholder tokens and resolves many of them in parallel each pass, in no fixed order, converging in far fewer passes. The animation also shows a brief re-noise step, where a low-confidence token is reset and refined again — a stand-in for the real model’s self-correction, which autoregressive decoding cannot do once a token is committed. Note this is a conceptual animation, not live model output: the real DiffusionGemma resolves a 256-token canvas and finalizes roughly 15–20 tokens per forward pass.

#mtp-dg-demo *{box-sizing:border-box!important;margin:0!important;padding:0!important}
#mtp-dg-demo hr,#mtp-dg-demo p:empty,#mtp-dg-demo del,#mtp-dg-demo s{display:none!important}
#mtp-dg-demo{font-family:’Google Sans’,-apple-system,BlinkMacSystemFont,’Segoe UI’,Roboto,Arial,sans-serif!important;max-width:820px!important;margin:24px auto!important;background:#ffffff!important;border:1px solid #dadce0!important;border-radius:16px!important;overflow:hidden!important;box-shadow:0 1px 3px rgba(60,64,67,.12),0 4px 16px rgba(60,64,67,.08)!important;color:#202124!important;line-height:1.55!important}
#mtp-dg-demo .dgd-bar{display:flex!important;height:6px!important;width:100%!important}
#mtp-dg-demo .dgd-bar span{flex:1 1 25%!important}
#mtp-dg-demo .dgd-bar .b{background:#4285F4!important}
#mtp-dg-demo .dgd-bar .r{background:#EA4335!important}
#mtp-dg-demo .dgd-bar .y{background:#FBBC04!important}
#mtp-dg-demo .dgd-bar .g{background:#34A853!important}
#mtp-dg-demo .dgd-head{padding:22px 28px 0 28px!important}
#mtp-dg-demo .dgd-badge{display:inline-block!important;font-size:11px!important;font-weight:600!important;letter-spacing:.06em!important;text-transform:uppercase!important;color:#1a73e8!important;background:#e8f0fe!important;border-radius:999px!important;padding:5px 12px!important;margin-bottom:12px!important}
#mtp-dg-demo .dgd-title{font-size:23px!important;font-weight:700!important;letter-spacing:-.01em!important}
#mtp-dg-demo .dgd-note{font-size:12.5px!important;color:#5f6368!important;margin-top:8px!important;background:#f8f9fa!important;border:1px solid #e8eaed!important;border-left:3px solid #FBBC04!important;border-radius:6px!important;padding:8px 12px!important}
#mtp-dg-demo .dgd-modes{display:flex!important;gap:8px!important;padding:18px 28px 6px 28px!important;flex-wrap:wrap!important}
#mtp-dg-demo .dgd-mode{cursor:pointer!important;border:1px solid #dadce0!important;background:#fff!important;color:#3c4043!important;font-family:inherit!important;font-size:13px!important;font-weight:600!important;border-radius:999px!important;padding:8px 16px!important;transition:all .15s!important}
#mtp-dg-demo .dgd-mode.is-active[data-mode=”diffusion”]{background:#e6f4ea!important;border-color:#34A853!important;color:#188038!important}
#mtp-dg-demo .dgd-mode.is-active[data-mode=”auto”]{background:#fce8e6!important;border-color:#EA4335!important;color:#c5221f!important}
#mtp-dg-demo .dgd-stage{margin:8px 28px!important;padding:18px!important;background:#f8f9fa!important;border:1px solid #e8eaed!important;border-radius:12px!important;min-height:118px!important;display:flex!important;flex-wrap:wrap!important;gap:7px!important;align-content:flex-start!important}
#mtp-dg-demo .dgd-tok{font-size:13.5px!important;font-weight:500!important;border-radius:7px!important;padding:7px 11px!important;background:#e8eaed!important;color:#9aa0a6!important;letter-spacing:.04em!important;transition:background .25s,color .25s,transform .25s!important;border:1px solid transparent!important}
#mtp-dg-demo .dgd-tok.masked{font-family:’SF Mono’,Menlo,Consolas,monospace!important}
#mtp-dg-demo .dgd-tok.lit{color:#fff!important;transform:translateY(-1px)!important}
#mtp-dg-demo .dgd-tok.c0{background:#4285F4!important}
#mtp-dg-demo .dgd-tok.c1{background:#EA4335!important}
#mtp-dg-demo .dgd-tok.c2{background:#34A853!important}
#mtp-dg-demo .dgd-tok.c3{background:#F9AB00!important}
#mtp-dg-demo .dgd-tok.fix{background:#fff!important;color:#c5221f!important;border:1px dashed #EA4335!important}
#mtp-dg-demo .dgd-meta{display:flex!important;gap:18px!important;padding:0 28px!important;flex-wrap:wrap!important}
#mtp-dg-demo .dgd-stat{font-size:12.5px!important;color:#5f6368!important}
#mtp-dg-demo .dgd-stat b{color:#202124!important;font-weight:700!important;font-size:15px!important;display:block!important}
#mtp-dg-demo .dgd-ctrl{display:flex!important;align-items:center!important;gap:10px!important;padding:16px 28px 4px 28px!important}
#mtp-dg-demo .dgd-play{cursor:pointer!important;border:none!important;background:#1a73e8!important;color:#fff!important;font-family:inherit!important;font-size:14px!important;font-weight:600!important;border-radius:999px!important;padding:10px 22px!important;transition:background .15s!important}
#mtp-dg-demo .dgd-play:hover{background:#1557b0!important}
#mtp-dg-demo .dgd-reset{cursor:pointer!important;border:1px solid #dadce0!important;background:#fff!important;color:#3c4043!important;font-family:inherit!important;font-size:14px!important;font-weight:600!important;border-radius:999px!important;padding:10px 18px!important}
#mtp-dg-demo .dgd-reset:hover{background:#f1f3f4!important}
#mtp-dg-demo .dgd-status{font-size:12.5px!important;color:#5f6368!important;font-weight:600!important}
#mtp-dg-demo .dgd-foot{background:#f8f9fa!important;border-top:1px solid #e8eaed!important;margin-top:16px!important;padding:14px 28px!important;display:flex!important;align-items:center!important;justify-content:space-between!important;gap:12px!important;flex-wrap:wrap!important}
#mtp-dg-demo .dgd-foot .brand{font-size:13px!important;font-weight:700!important;color:#202124!important}
#mtp-dg-demo .dgd-foot .brand span{color:#4285F4!important}
#mtp-dg-demo .dgd-foot .tag{font-size:12px!important;color:#5f6368!important}
@media (max-width:640px){
#mtp-dg-demo .dgd-head{padding:18px 18px 0 18px!important}
#mtp-dg-demo .dgd-title{font-size:19px!important}
#mtp-dg-demo .dgd-modes,#mtp-dg-demo .dgd-meta,#mtp-dg-demo .dgd-ctrl{padding-left:18px!important;padding-right:18px!important}
#mtp-dg-demo .dgd-stage{margin:8px 18px!important;padding:14px!important}
#mtp-dg-demo .dgd-foot{padding:13px 18px!important}
}

Interactive · Illustrative
Watch DiffusionGemma Decode in Parallel
This is a conceptual animation of the denoising process — not live model output. The real model resolves a 256-token canvas, finalizing ~15–20 tokens per forward pass.

Diffusion (parallel)
Autoregressive (sequential)

0Forward passes
0 / 16Tokens resolved
DiffusionDecoding mode

▶ Run
Reset
Press Run to start.

Marktechpost
Practitioner-first AI/ML coverage — deep dives, model releases, and research, decoded for builders.

(function(){
var root=document.getElementById(‘mtp-dg-demo’);
if(!root)return;
var TARGET=[“DiffusionGemma”,”generates”,”entire”,”blocks”,”of”,”text”,”in”,”parallel”,”instead”,”of”,”one”,”token”,”at”,”a”,”time”,”.”];
var N=TARGET.length;
var stage=root.querySelector(‘#dgd-stage’);
var stepEl=root.querySelector(‘#dgd-step’);
var doneEl=root.querySelector(‘#dgd-done’);
var modeLbl=root.querySelector(‘#dgd-mode-lbl’);
var statusEl=root.querySelector(‘#dgd-status’);
var playBtn=root.querySelector(‘#dgd-play’);
var resetBtn=root.querySelector(‘#dgd-reset’);
var modeBtns=root.querySelectorAll(‘.dgd-mode’);
var mode=’diffusion’,resolved=[],timer=null,step=0,running=false,fixDone=false;

function build(){
stage.innerHTML=”;resolved=[];
for(var i=0;i