Gemma 4 MTP Drafters: How Multi-Token Prediction Delivers 2x+ Faster Local Inference
On May 5, 2026, Google released Multi-Token Prediction (MTP) drafters for the Gemma 4 family. The headline claim — up to 3x inference speedup — is technically accurate on specific hardware. The more realistic number for most developer setups is 1.7x to 2.2x, which is still a meaningful improvement on the kind of hardware developers actually use for local model inference.
This article covers how MTP works, what the real-world speedups look like across different hardware, and how to enable it with the frameworks that already support it.
What MTP Does
Standard LLM inference generates one token at a time. Each token requires a full forward pass through the model — and since each forward pass depends on the previous token's output, the process is inherently sequential. This is the bottleneck.
Multi-Token Prediction addresses this with speculative execution: pair the large model with a small, fast "drafter" model. The drafter predicts several tokens ahead in parallel. The large model then verifies the predictions in a single pass — accepting correct ones and rejecting wrong ones.
When the drafter is right (which is often, especially in high-predictability text), multiple tokens are committed per large-model pass. When the drafter is wrong, the large model rejects the prediction and generates the correct token itself — no worse than standard inference, just slower for that step.
The critical property: output is bit-for-bit identical to what standard inference would have produced. The large model has final authority over every token. MTP is purely a speed optimization with no quality trade-off.
Hardware-Specific Speedups
Google's benchmarks from the official May 5 announcement (source: blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/):
| Hardware | Speedup |
|---|---|
| NVIDIA RTX Pro 6000 | ~2x (nearly doubles token throughput) |
| Apple Silicon (M-series, general) | 2.2x |
| Apple M4 with SME | up to 1.5x on CPU |
| Mobile GPU (general) | up to 2.2x |
| Typical developer hardware | 1.7x–2.2x |
The 3x peak number is achievable under specific conditions (framework, model size, and hardware combination). For benchmarking your own setup, starting from 1.7x is a more realistic expectation.
The speedup is larger when:
- The text being generated is more predictable (code, structured output, repetitive patterns)
- The drafter model's architecture matches the target model well
- Memory bandwidth is the bottleneck rather than compute
Architecture: Target + Drafter
Gemma 4's MTP implementation uses a separate, smaller drafter model paired with each target model. These are the released drafter pairings:
gemma-4-e2b-it→gemma-4-e2b-it-mtpgemma-4-e4b-it→gemma-4-e4b-it-mtpgemma-4-pt-27b-it→gemma-4-pt-27b-it-mtp
The drafter is significantly smaller than the target model, so the combined memory overhead is lower than running two full models. The exact parameter counts are not publicly disclosed, but the drafter is designed to fit alongside the target model within normal VRAM budgets.
Enabling MTP with Ollama
If you are already running Gemma 4 with Ollama, MTP is the simplest path:
# Pull the target model (if you haven't already)
ollama pull gemma4:27b
# Pull the MTP drafter
ollama pull gemma4:27b-it-mtp
# Run with MTP — Ollama handles the pairing automatically
ollama run gemma4:27b --draft gemma4:27b-it-mtp
Ollama's implementation handles the speculation loop internally. You interact with the model exactly as before — no changes to your API calls.
Enabling MTP with vLLM
vLLM supports Gemma 4 MTP via its speculative decoding configuration. The key is pointing the --speculative-model flag at the drafter:
python -m vllm.entrypoints.openai.api_server \
--model google/gemma-4-pt-27b-it \
--speculative-model google/gemma-4-pt-27b-it-mtp \
--num-speculative-tokens 5 \
--gpu-memory-utilization 0.92
--num-speculative-tokens controls how many tokens the drafter attempts to predict ahead. Higher values can improve throughput when the drafter is accurate, but increase cost when it is wrong. Starting at 5 is a reasonable default.
The API endpoint remains the same — applications using the OpenAI-compatible API do not need modification.
Enabling MTP with MLX (Apple Silicon)
For Apple Silicon users, MLX is the fastest inference path for Gemma 4. MLX supports MTP drafters natively:
from mlx_lm import load, generate
from mlx_lm.utils import load_draft
# Load target model
model, tokenizer = load("google/gemma-4-pt-27b-it")
# Load drafter
draft_model, _ = load_draft("google/gemma-4-pt-27b-it-mtp")
# Generate with speculative decoding
response = generate(
model,
tokenizer,
prompt="Explain how KV-cache quantization works:",
draft_model=draft_model,
num_draft_tokens=4,
max_tokens=512,
)
print(response)
The 2.2x speedup figure comes from this MLX + Apple Silicon combination. Metal GPU acceleration is what makes the parallel verification step fast.
Enabling MTP with HuggingFace Transformers
HuggingFace Transformers added Gemma 4 MTP support alongside the May 5 release:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
target_model = AutoModelForCausalLM.from_pretrained(
"google/gemma-4-pt-27b-it",
torch_dtype=torch.bfloat16,
device_map="auto",
)
draft_model = AutoModelForCausalLM.from_pretrained(
"google/gemma-4-pt-27b-it-mtp",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-pt-27b-it")
inputs = tokenizer("Explain transformer attention:", return_tensors="pt").to("cuda")
outputs = target_model.generate(
**inputs,
assistant_model=draft_model,
do_sample=False,
max_new_tokens=200,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
The assistant_model parameter is HuggingFace's API for speculative decoding. No other changes required.
When MTP Helps Most
The speedup is not uniform. MTP works best when:
The drafter is often right. Code generation, structured output (JSON, markdown), and template-heavy text see larger speedups than free-form creative writing where each token is harder to predict.
Memory bandwidth is the bottleneck. On devices where memory bandwidth limits throughput (most consumer GPUs, all Apple Silicon), reducing the number of full-model passes through MTP gives proportional gains. On memory-compute balanced hardware, gains are smaller.
Context is long. The KV cache grows with context length. Longer contexts make each autoregressive step more expensive, so reducing the number of steps via speculation is proportionally more valuable.
What Does Not Change
Two things are worth being clear about:
Output quality. MTP does not degrade Gemma 4's outputs. The target model verifies every token before committing it. If you run the same generation with and without MTP (using deterministic decoding), you get identical output.
Your API. Whether you are using Ollama's API, vLLM's OpenAI-compatible endpoint, or HuggingFace Transformers, MTP is a configuration at the infrastructure level. Application code that talks to the model does not change.
Availability
The Gemma 4 MTP drafters are available under the same Apache 2.0 license as the base Gemma 4 models:
- HuggingFace:
google/gemma-4-*-mtpmodel IDs - Kaggle: Available in the Gemma 4 model collection
- Ollama: Available via
ollama pull gemma4:*-mtp
Framework support is live for: vLLM, MLX, SGLang, LiteRT-LM, and HuggingFace Transformers.
How It Compares to Other Gemma 4 Guides
If you found this via search, two other Effloow articles may also be relevant depending on your situation:
- Gemma 4 Local Setup with Ollama and Open WebUI — Getting Gemma 4 running from scratch, hardware requirements, model size comparison.
- Gemma 4 26B vs 31B: Which to Run Locally — MoE vs dense tradeoffs, VRAM requirements, quantization guide.
This article is specifically about MTP as a speedup technique, not the initial setup or model selection.
Summary
Gemma 4 MTP drafters, released May 5, 2026, are a practical inference acceleration with no quality trade-off. The realistic speedup range on developer hardware is 1.7x–2.2x — enough to be noticeable in interactive applications. Enabling it requires pulling one extra model file and adding one configuration option to your existing inference setup.
For local Gemma 4 users, this is a straightforward upgrade: the models are freely available, the frameworks are already updated, and the integration is a configuration change rather than an architectural one.
Need content like this
for your blog?
We run AI-powered technical blogs. Start with a free 3-article pilot.