Gemma 4 MTP Drafters: 3x Faster Inference via Speculative Decoding
On May 5, 2026, Google released Multi-Token Prediction (MTP) drafters for the Gemma 4 model family. The headline claim: up to 3x faster inference, with no degradation in output quality.
This is not a research preview. The drafters are available now on HuggingFace under Apache 2.0, compatible with HuggingFace Transformers, vLLM, MLX, LiteRT-LM, and LLaMA.cpp. This article explains the architecture, why it works better than a naive small-model approach, and how to use it.
What Speculative Decoding Is
Standard autoregressive LLM inference has a fundamental bottleneck: to generate N tokens, you need N sequential forward passes through the full model. Each pass is blocked by the previous one.
Speculative decoding breaks this bottleneck with a two-model setup:
- A small, fast draft model generates a batch of N candidate tokens autoregressively
- The large target model verifies all N candidates in a single parallel forward pass
- If the target accepts the draft tokens, you get N tokens at the cost of roughly one target model forward pass
The effective speedup depends on how often the target accepts the draft tokens — called the acceptance rate. Higher acceptance rate = greater speedup.
What Gemma 4's MTP Drafters Do Differently
Most speculative decoding implementations pair an arbitrary small model with the target. This has a known problem: the small model lives in a different probability distribution. Draft tokens can be plausible but consistently off-distribution from the target, driving the acceptance rate down.
Gemma 4's "assistant" models solve this by being architecturally coupled to the target:
- The drafter shares the target model's embedding table — no vocabulary mismatch
- The drafter shares the target's KV cache — context is consistent
- For edge variants (E2B and E4B), an efficient embedding clustering technique eliminates the final logit calculation bottleneck
The result is a higher acceptance rate than you'd get by pairing Gemma 4 with an off-the-shelf small LM. This is why the 3x figure holds up across diverse tasks — code, math, and general conversation — not just easy repetitive outputs where speculative decoding typically shines.
Available Model Variants
| Model variant | Target ID | Drafter ID |
|---|---|---|
| E2B (edge, 2B) | google/gemma-4-E2B-it |
google/gemma-4-E2B-it-assistant |
| E4B (edge, 4B) | google/gemma-4-E4B-it |
google/gemma-4-E4B-it-assistant |
| 26B-A4B (MoE) | google/gemma-4-26B-A4B-it |
google/gemma-4-26B-A4B-it-assistant |
| 31B | google/gemma-4-31B-it |
google/gemma-4-31B-it-assistant |
The E2B and E4B are designed for on-device and edge inference. The 26B-A4B is a Mixture-of-Experts variant — 26B total parameters but only 4B active per token — which is the practical pick for GPU inference on a 40GB card.
HuggingFace Transformers Integration
The integration follows the standard HuggingFace assistant_model pattern in generate(). You load both the target and the drafter, then pass the drafter as assistant_model:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
TARGET_MODEL = "google/gemma-4-26B-A4B-it"
ASSISTANT_MODEL = "google/gemma-4-26B-A4B-it-assistant"
tokenizer = AutoTokenizer.from_pretrained(TARGET_MODEL)
target = AutoModelForCausalLM.from_pretrained(
TARGET_MODEL,
torch_dtype=torch.bfloat16,
device_map="auto"
)
assistant = AutoModelForCausalLM.from_pretrained(
ASSISTANT_MODEL,
torch_dtype=torch.bfloat16,
device_map="auto"
)
prompt = "Explain the key trade-offs of speculative decoding in three paragraphs."
inputs = tokenizer(prompt, return_tensors="pt").to(target.device)
output = target.generate(
**inputs,
assistant_model=assistant,
max_new_tokens=512,
do_sample=False
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
That is the complete integration. No custom sampling loop, no patching the model internals. The assistant_model parameter in generate() handles the speculative decoding loop internally.
vLLM Integration
For production serving, vLLM supports speculative decoding natively:
from vllm import LLM, SamplingParams
llm = LLM(
model="google/gemma-4-26B-A4B-it",
speculative_model="google/gemma-4-26B-A4B-it-assistant",
num_speculative_tokens=5,
)
sampling_params = SamplingParams(temperature=0, max_tokens=512)
outputs = llm.generate(["Explain transformer attention in detail."], sampling_params)
print(outputs[0].outputs[0].text)
num_speculative_tokens controls how many draft tokens the assistant generates per step. Google's benchmarks used 5; tuning this per workload can improve or hurt throughput depending on the typical output length and vocabulary.
MLX (Apple Silicon)
For local inference on Apple Silicon, the mlx-lm library supports the same pattern:
pip install mlx-lm
python -m mlx_lm.generate \
--model google/gemma-4-E4B-it \
--draft-model google/gemma-4-E4B-it-assistant \
--prompt "Summarize the key ideas of speculative decoding."
The E2B and E4B edge variants are the practical choices here — the 31B and 26B-A4B are too large for typical M-series RAM.
Why the 3x Figure Holds
The 3x speedup is conditional on the acceptance rate staying high. For Gemma 4, Google published that the acceptance rate is consistently above 80% on standard benchmarks because of the shared embedding space. Accepting 80% of 5 draft tokens means you're generating roughly 4 tokens per target model forward pass instead of 1 — a 4x theoretical speedup, which benchmarks at ~3x after overhead.
For tasks with less predictable outputs (creative writing, adversarial prompts, large codebase completions), the acceptance rate drops, and the effective speedup may be closer to 1.5–2x. Still meaningful — but the headline 3x is the ceiling, not the floor.
Framework Support Matrix
| Framework | Status | Notes |
|---|---|---|
| HuggingFace Transformers | Available | assistant_model parameter |
| vLLM | Available | speculative_model + num_speculative_tokens |
| LiteRT-LM | Available | Edge models (E2B, E4B) only |
| MLX | Available | Apple Silicon, edge models preferred |
| LLaMA.cpp | Available | --draft-model flag |
| Ollama | Not yet listed | Watch for updates |
When to Use MTP Drafters
Use them when you're self-hosting Gemma 4 and throughput matters. The trade-off is that you need to load both the target and the assistant into memory, which increases VRAM requirements. For the 26B-A4B target, the assistant adds roughly 2–3GB.
If you're hitting the inference throughput ceiling (tokens/second) and have spare VRAM, MTP drafters are the lowest-friction speedup available for Gemma 4. No fine-tuning, no architectural changes — just load the assistant model and pass it to generate().
Sources:
- Accelerating Gemma 4: faster inference with multi-token prediction drafters — blog.google
- Gemma 4 MTP overview — ai.google.dev
- Gemma 4 MTP with HuggingFace Transformers — ai.google.dev
- Welcome Gemma 4 — huggingface.co
- Google releases MTP drafters for Gemma 4 — MarkTechPost
- LLaMA.cpp speculative decoding for Gemma 4 — thecodersblog.com
Need content like this
for your blog?
We run AI-powered technical blogs. Start with a free 3-article pilot.