DiffusionGemma Release (Image © Google)
Unlike traditional models that predict the next token in a linear sequence, DiffusionGemma generates text in parallel blocks. It starts with a canvas of random placeholder tokens and goes through several passes of iterative refinement. During this process, the model determines the correct tokens and uses them as context to refine the remaining text until a final output is achieved.
The model is built as a “Mixture of Experts” system with a total of 26 billion parameters, but only activates 3.8 billion per inference. This architecture allows the model to be quantized to fit into 18 GB of VRAM, making it accessible to high-end consumer GPUs.
Performance metrics and hardware utilization
DiffusionGemma was developed to solve the memory bandwidth bottlenecks commonly encountered in local LLM inference. By increasing the computational load per run, it achieves up to four times faster generation speed on dedicated GPUs than autoregressive models. Technical benchmarks show speeds of over 1,000 tokens per second on the NVIDIA H100 and over 700 tokens per second on the GeForce RTX 5090.
To further increase throughput, the model supports NVFP4 (4-bit floating point) kernels, which accelerate compute speeds on NVIDIA Hopper and Blackwell architectures with minimal loss of accuracy.
Applications in non-linear text domains
The use of bidirectional attention - where any token in a 256-token block can pay attention to any other token - provides a technical advantage for nonlinear tasks. This makes DiffusionGemma particularly suitable for:
- Code infilling: Completing missing sections of code based on the surrounding context.
- Inline editing**: Fast iteration on specific sections of a text block.
- Complex structures**: Creating math graphs, amino acid sequences, and solving logic puzzles like Sudoku where future tokens influence current ones.
Implementation and quality aspects
While DiffusionGemma offers significant speed advantages, there is a documented trade-off in output quality. The model prioritizes generation speed and parallel layout over the high precision found in standard Gemma-4 models. Consequently, it is positioned as a tool for researchers and developers who focus on interactive, speed-critical workflows rather than final production results.
The model is available under the Apache 2.0 license via Hugging Face. Integration is supported by several frameworks, including vLLM, MLX and Hugging Face Transformers, with additional fine-tuning options available via Unsloth and NVIDIA NeMo.



