Google’s Best Small AI Models Now 3x Faster

Google is quietly reshaping how its Gemma 4 AI models work—and the impact could be significant for developers.

In a recent update, the company introduced a new architecture designed to make text generation much faster, with improvements of up to 300% in some cases.

At the heart of this change is a method called Multi-Token Prediction (MTP). Instead of generating text one word at a time, the system now uses smaller helper models, known as “drafters,” to predict multiple words in advance. This helps reduce one of the biggest slowdowns in AI systems: the constant movement of data between memory and processing units.

The Gemma 4 model family, released not long ago with open weights, has already crossed 60 million downloads. Its growing popularity reflects how quickly developers are adopting open and flexible AI tools.

Earlier, most large language models followed a strict, step-by-step approach—producing one token at a time. While reliable, that process can be slow and resource-heavy. Google’s updated system changes that flow by separating prediction from verification.

In simple terms, a lightweight model first drafts several possible tokens at once. Then a larger, more powerful model—such as the 31-billion-parameter Gemma 4—reviews those suggestions together in a single pass. This reduces repeated work and speeds up output without compromising accuracy.

Engineers at Google note that older systems often spent equal computing effort on both simple and complex tasks. The new design avoids that imbalance, allowing routine text generation to happen more efficiently.

The performance boost is visible across different hardware environments. On Apple Silicon devices, for example, the 26-billion-parameter mixture-of-experts model can run over twice as fast when handling multiple tasks simultaneously.

Google has also improved its smaller edge models, including Gemma 4 E2B and E4B. With better clustering techniques, these models are more efficient on mobile devices, helping conserve battery life while maintaining performance. Since the final output is still verified by the main model, accuracy and reasoning remain consistent.

Another key improvement lies in how the drafter models reuse the main model’s cache and intermediate results. This avoids repeating calculations and makes the system more efficient overall.

The MTP-based drafters are being released under the Apache 2.0 open-source licence, allowing developers to freely use and adapt them. Model weights are available on platforms like Hugging Face and Kaggle, with support from tools such as SGLang already in place.

The update also works smoothly with frameworks like MLX and Ollama, making it easier to build faster AI applications—from real-time coding assistants to autonomous agents—on everyday hardware.

Also Read: Mistral AI Releases Leanstral and Small 4 AI Models

Google’s Best Small AI Models Now 3x Faster

Mane Sachin

Related News

LATEST NEWS