Google DeepMind Gemma4 (Image © Google)
The 12B model offers a significant leap in efficiency, with benchmark performance almost matching that of the larger 26B model. This allows you to run complex multi-stage inference and agentic workflows locally without the need for extensive cloud computing resources.
To further improve accessibility and speed, Google DeepMind has released quantization-aware training weights for the entire Gemma-4 product suite. While traditional quantization often results in a loss of model accuracy, QAT incorporates the quantization process directly into the training phase. This approach minimizes memory requirements and speeds up token generation while maintaining output quality compared to the original weights.
These optimizations provide broader hardware compatibility, with performance improvements seen on chips from NVIDIA, AMD, Intel, Qualcomm and Apple. The QAT weights are currently available for a wide range of model sizes, including E2B, E4B, 12B, 26B and 31B versions.
The integration of the new model and weights has been optimized via [Ollama][1]. Users can use the 12B model in various developer tools and applications such as Claude Code, Codex App, Hermes Agent and OpenClaw as well as for general chat purposes. [1]: https://www.pcmasters.de/server/133714724-ai-chatbot-hosten-auf-eigenem-server-auf-ubuntu-debian-mit-ollama-und-open-webui.html


