DiffusionGemma: This is what parallel text diffusion brings for local inference

11.06.2026, 17:30 , von News-Redaktion

The release of DiffusionGemma introduces an experimental approach to text generation that moves away from the sequential, token-by-token processing typical of autoregressive Large Language Models (LLMs). By using text diffusion, this 26 billion Mixture of Experts (MoE) model can generate entire blocks of text simultaneously, significantly reducing the latency of local deployments.

DiffusionGemma Release Image © Google DiffusionGemma Release (Image © Google)

Unlike traditional models that predict the next token in a linear sequence, DiffusionGemma generates text in parallel blocks. It starts with a canvas of random placeholder tokens and goes through several passes of iterative refinement. During this process, the model determines the correct tokens and uses them as context to refine the remaining text until a final output is achieved.

The model is built as a “Mixture of Experts” system with a total of 26 billion parameters, but only activates 3.8 billion per inference. This architecture allows the model to be quantized to fit into 18 GB of VRAM, making it accessible to high-end consumer GPUs.

DiffusionGemma Benchmark (Image © Google)

Performance metrics and hardware utilization

DiffusionGemma was developed to solve the memory bandwidth bottlenecks commonly encountered in local LLM inference. By increasing the computational load per run, it achieves up to four times faster generation speed on dedicated GPUs than autoregressive models. Technical benchmarks show speeds of over 1,000 tokens per second on the NVIDIA H100 and over 700 tokens per second on the GeForce RTX 5090.

DiffusionGemma Intelligence vs Latence (Image © Google)

To further increase throughput, the model supports NVFP4 (4-bit floating point) kernels, which accelerate compute speeds on NVIDIA Hopper and Blackwell architectures with minimal loss of accuracy.

DiffusionGemma Model (Image © Google)

Applications in non-linear text domains

The use of bidirectional attention - where any token in a 256-token block can pay attention to any other token - provides a technical advantage for nonlinear tasks. This makes DiffusionGemma particularly suitable for:

Code infilling: Completing missing sections of code based on the surrounding context.
Inline editing**: Fast iteration on specific sections of a text block.
Complex structures**: Creating math graphs, amino acid sequences, and solving logic puzzles like Sudoku where future tokens influence current ones.

Implementation and quality aspects

While DiffusionGemma offers significant speed advantages, there is a documented trade-off in output quality. The model prioritizes generation speed and parallel layout over the high precision found in standard Gemma-4 models. Consequently, it is positioned as a tool for researchers and developers who focus on interactive, speed-critical workflows rather than final production results.

The model is available under the Apache 2.0 license via Hugging Face. Integration is supported by several frameworks, including vLLM, MLX and Hugging Face Transformers, with additional fine-tuning options available via Unsloth and NVIDIA NeMo.

Quelle: Google

News-Redaktion

The news editorial team provides news on all topics in the IT sector...

295 articles Email

DiffusionGemma: This is what parallel text diffusion brings for local inference

Performance metrics and hardware utilization

Applications in non-linear text domains

Implementation and quality aspects

Support PCMasters