Skip to content
Effloow
← Back to Articles
ARTICLES ·2026-05-17 ·BY EFFLOOW CONTENT FACTORY

Gemma 4 MTP Drafters: 3x Faster Inference via Speculative Decoding

Google released MTP drafters for Gemma 4 on May 5, 2026. This paper-to-code walkthrough explains the architecture and shows the HuggingFace Transformers integration.
gemma speculative-decoding llm-inference huggingface google open-source-ai paper-poc
SHARE
Gemma 4 MTP Drafters: 3x Faster Inference via Speculative Decoding

On May 5, 2026, Google released Multi-Token Prediction (MTP) drafters for the Gemma 4 model family. The headline claim: up to 3x faster inference, with no degradation in output quality.

This is not a research preview. The drafters are available now on HuggingFace under Apache 2.0, compatible with HuggingFace Transformers, vLLM, MLX, LiteRT-LM, and LLaMA.cpp. This article explains the architecture, why it works better than a naive small-model approach, and how to use it.


What Speculative Decoding Is

Standard autoregressive LLM inference has a fundamental bottleneck: to generate N tokens, you need N sequential forward passes through the full model. Each pass is blocked by the previous one.

Speculative decoding breaks this bottleneck with a two-model setup:

  1. A small, fast draft model generates a batch of N candidate tokens autoregressively
  2. The large target model verifies all N candidates in a single parallel forward pass
  3. If the target accepts the draft tokens, you get N tokens at the cost of roughly one target model forward pass

The effective speedup depends on how often the target accepts the draft tokens — called the acceptance rate. Higher acceptance rate = greater speedup.


What Gemma 4's MTP Drafters Do Differently

Most speculative decoding implementations pair an arbitrary small model with the target. This has a known problem: the small model lives in a different probability distribution. Draft tokens can be plausible but consistently off-distribution from the target, driving the acceptance rate down.

Gemma 4's "assistant" models solve this by being architecturally coupled to the target:

  • The drafter shares the target model's embedding table — no vocabulary mismatch
  • The drafter shares the target's KV cache — context is consistent
  • For edge variants (E2B and E4B), an efficient embedding clustering technique eliminates the final logit calculation bottleneck

The result is a higher acceptance rate than you'd get by pairing Gemma 4 with an off-the-shelf small LM. This is why the 3x figure holds up across diverse tasks — code, math, and general conversation — not just easy repetitive outputs where speculative decoding typically shines.


Available Model Variants

Model variant Target ID Drafter ID
E2B (edge, 2B) google/gemma-4-E2B-it google/gemma-4-E2B-it-assistant
E4B (edge, 4B) google/gemma-4-E4B-it google/gemma-4-E4B-it-assistant
26B-A4B (MoE) google/gemma-4-26B-A4B-it google/gemma-4-26B-A4B-it-assistant
31B google/gemma-4-31B-it google/gemma-4-31B-it-assistant

The E2B and E4B are designed for on-device and edge inference. The 26B-A4B is a Mixture-of-Experts variant — 26B total parameters but only 4B active per token — which is the practical pick for GPU inference on a 40GB card.


HuggingFace Transformers Integration

The integration follows the standard HuggingFace assistant_model pattern in generate(). You load both the target and the drafter, then pass the drafter as assistant_model:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

TARGET_MODEL = "google/gemma-4-26B-A4B-it"
ASSISTANT_MODEL = "google/gemma-4-26B-A4B-it-assistant"

tokenizer = AutoTokenizer.from_pretrained(TARGET_MODEL)

target = AutoModelForCausalLM.from_pretrained(
    TARGET_MODEL,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
assistant = AutoModelForCausalLM.from_pretrained(
    ASSISTANT_MODEL,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

prompt = "Explain the key trade-offs of speculative decoding in three paragraphs."
inputs = tokenizer(prompt, return_tensors="pt").to(target.device)

output = target.generate(
    **inputs,
    assistant_model=assistant,
    max_new_tokens=512,
    do_sample=False
)

print(tokenizer.decode(output[0], skip_special_tokens=True))

That is the complete integration. No custom sampling loop, no patching the model internals. The assistant_model parameter in generate() handles the speculative decoding loop internally.


vLLM Integration

For production serving, vLLM supports speculative decoding natively:

from vllm import LLM, SamplingParams

llm = LLM(
    model="google/gemma-4-26B-A4B-it",
    speculative_model="google/gemma-4-26B-A4B-it-assistant",
    num_speculative_tokens=5,
)

sampling_params = SamplingParams(temperature=0, max_tokens=512)
outputs = llm.generate(["Explain transformer attention in detail."], sampling_params)
print(outputs[0].outputs[0].text)

num_speculative_tokens controls how many draft tokens the assistant generates per step. Google's benchmarks used 5; tuning this per workload can improve or hurt throughput depending on the typical output length and vocabulary.


MLX (Apple Silicon)

For local inference on Apple Silicon, the mlx-lm library supports the same pattern:

pip install mlx-lm

python -m mlx_lm.generate \
  --model google/gemma-4-E4B-it \
  --draft-model google/gemma-4-E4B-it-assistant \
  --prompt "Summarize the key ideas of speculative decoding."

The E2B and E4B edge variants are the practical choices here — the 31B and 26B-A4B are too large for typical M-series RAM.


Why the 3x Figure Holds

The 3x speedup is conditional on the acceptance rate staying high. For Gemma 4, Google published that the acceptance rate is consistently above 80% on standard benchmarks because of the shared embedding space. Accepting 80% of 5 draft tokens means you're generating roughly 4 tokens per target model forward pass instead of 1 — a 4x theoretical speedup, which benchmarks at ~3x after overhead.

For tasks with less predictable outputs (creative writing, adversarial prompts, large codebase completions), the acceptance rate drops, and the effective speedup may be closer to 1.5–2x. Still meaningful — but the headline 3x is the ceiling, not the floor.


Framework Support Matrix

Framework Status Notes
HuggingFace Transformers Available assistant_model parameter
vLLM Available speculative_model + num_speculative_tokens
LiteRT-LM Available Edge models (E2B, E4B) only
MLX Available Apple Silicon, edge models preferred
LLaMA.cpp Available --draft-model flag
Ollama Not yet listed Watch for updates

When to Use MTP Drafters

Use them when you're self-hosting Gemma 4 and throughput matters. The trade-off is that you need to load both the target and the assistant into memory, which increases VRAM requirements. For the 26B-A4B target, the assistant adds roughly 2–3GB.

If you're hitting the inference throughput ceiling (tokens/second) and have spare VRAM, MTP drafters are the lowest-friction speedup available for Gemma 4. No fine-tuning, no architectural changes — just load the assistant model and pass it to generate().


Sources:

Need content like this
for your blog?

We run AI-powered technical blogs. Start with a free 3-article pilot.

Learn more →

More in Articles

Stay in the loop.

One dispatch every Friday. New articles, tool releases, and a short note from the editor.

Get weekly AI tool reviews & automation tips

Join our newsletter. No spam, unsubscribe anytime.