Effloow
← Back to Articles
ARTICLES ·2026-04-28 ·BY EFFLOOW CONTENT FACTORY

vLLM 0.8: Native Llama 4 MoE Routing Explained

How vLLM 0.8 achieves 40% throughput gains on MoE models via Expert Parallelism Load Balancing. Covers EPLB, Llama 4 deployment, and speculative decoding.
vllm moe llama-4 expert-parallelism inference ai-infrastructure speculative-decoding
SHARE
vLLM 0.8: Native Llama 4 MoE Routing Explained

Mixture-of-Experts models have dominated the open-weight frontier in 2026. Llama 4 Scout (17B-16E), Llama 4 Maverick (17B-128E), DeepSeek V4-Pro (1.6T-49B active), and Qwen3.6-Plus all use sparse expert routing to scale parameters without proportionally scaling compute. But serving them efficiently is a different problem from serving dense transformers — and vLLM 0.8 is the first release to address it head-on.

This article explains what changed, why MoE routing was a bottleneck before vLLM 0.8, and how to deploy Llama 4 Maverick today using Expert Parallelism and the new load balancer.

Effloow Lab created a Python PoC to simulate the core routing imbalance problem and validate the EPLB concept — details in data/lab-runs/vllm-08-llama4-moe-routing-performance-2026.md.

Why MoE Models Break Standard Inference Pipelines

In a dense transformer, every token flows through the same set of layers. Parallelism strategy is straightforward: split layers across tensor-parallel GPUs, and each GPU processes a shard of every token's computation. Load is inherently balanced.

MoE breaks this assumption. A MoE layer contains N expert networks (Llama 4 Maverick has 128 of them). For each token, a learned router computes logits over all experts and selects only the top-k — typically top-1 for Llama 4. Only the selected experts receive the token; the rest compute nothing.

This creates expert load imbalance: some experts consistently attract more tokens than others. When you distribute experts across GPUs for Expert Parallelism (EP), the GPU holding the hot expert becomes the bottleneck. Every other GPU finishes its expert computation and then waits — because MoE layers use a synchronization barrier before the next layer begins.

The Math Behind the Bottleneck

Each parallel expert compute step has this structure:

[GPU 0: experts 0-31]  [GPU 1: experts 32-63]  [GPU 2: experts 64-95]  [GPU 3: experts 96-127]
         ↓ tokens_0              ↓ tokens_1              ↓ tokens_2              ↓ tokens_3
         compute                 compute                 compute                 compute
                      ← ─ ─ ─ ─ synchronize ─ ─ ─ ─ →
                           (wait for slowest GPU)

Throughput is determined by total_tokens / max(tokens_per_rank). If one GPU handles 4× the mean token count (our Lab PoC measured 4.36× imbalance on a skewed Llama 4 Maverick workload), the system wastes roughly 75% of its aggregate compute capacity.

Real workloads often exhibit this skew because different natural language domains activate different expert clusters. Code prompts, English prose, and multilingual input all tend to route to different subsets of experts.

Expert Parallelism in vLLM 0.8

Before vLLM 0.8, MoE layers in vLLM used Tensor Parallelism (TP) as the default — the same approach used for dense models. In TP mode, each MoE layer's weight matrices are column/row-sharded across the TP group. Every token hits every GPU on every MoE forward pass, even when that GPU's expert shard does nothing for that token.

vLLM 0.8 (starting at v0.8.3 for Llama 4) introduces native Expert Parallelism, enabled by --enable-expert-parallel. In EP mode:

  • Each EP rank owns a distinct subset of experts
  • Only the experts selected by the router receive tokens — all others are truly skipped
  • Inter-GPU communication shifts from all-reduce (TP) to all-to-all dispatch (EP)

The all-to-all pattern is more expensive per-token in the communication layer, but for MoE models with sparse activation, the overall throughput gain is substantial because compute — not communication — was the bottleneck.

Expert Parallelism Load Balancer (EPLB)

The load balancing challenge is addressed by EPLB, enabled with --enable-eplb. The implementation follows the hierarchical load balancing policy from DeepSeek's large-scale serving work, adapted for vLLM's V1 engine.

How EPLB Works

  1. Load collection: Every forward pass records per-expert token counts across all EP ranks.
  2. Sliding window aggregation: A window of recent batches provides a moving average of expert utilization — smoothing out single-batch spikes.
  3. Expert remapping: Periodically, EPLB recomputes an optimal expert-to-rank assignment. Hot experts (above a load threshold) are replicated onto underutilized ranks. Cold experts may be consolidated.
  4. Live migration: The new mapping takes effect on the next batch with no request interruption.

The core insight is that you can trade memory (by replicating a hot expert's weights to a second GPU) for latency. If expert 17 is receiving 4× the mean load, placing a replica on a second rank halves its per-rank load — at the cost of twice the VRAM for that expert's weights.

Our Lab PoC simulated this on Llama 4 Maverick's 128-expert topology with 10,000 tokens under skewed workload conditions:

Baseline (no EPLB):
  Max load (hot expert): 341 tokens
  Mean load: 78.1 tokens
  Imbalance ratio: 4.36×

After EPLB (simulated):
  Max load: 93.8 tokens
  Imbalance ratio: 1.20×
  Max load reduction: 72.5%

The theoretical throughput model (throughput ∝ tokens / max_load) shows 3.64× improvement from perfect rebalancing. In practice, vLLM 0.8 reports approximately 40% real-world throughput improvement — lower than the theoretical ceiling because EPLB overhead, memory constraints, and other system bottlenecks cap the gains.

It is also worth noting that Red Hat's benchmarks found EPLB to be a net negative on some H100 + DeepSeek configurations, suggesting its effectiveness is workload-dependent. For bursty, skewed workloads typical of multilingual or multi-domain Llama 4 deployments, EPLB tends to deliver the largest gains.

Speculative Decoding Improvements for MoE

vLLM 0.8 also ships two speculative decoding improvements that interact with MoE routing:

GPU-Accelerated NGram Speculative Decoding

Previously, NGram draft generation ran on CPU, creating a CPU-GPU synchronization overhead that made it impractical for high-throughput serving. In 0.8, NGram spec decode runs on GPU and integrates with the async scheduler, reducing overhead enough to yield net speedup on typical chatbot workloads.

For MoE models, the interaction is subtle: the draft model and target model may activate different expert subsets per speculative token. If most speculative tokens are accepted, the expert routing patterns become more predictable (frequent short-distance continuations), which can slightly improve EPLB's load prediction accuracy.

Async Scheduling with Zero-Bubble Overlap

The async scheduler introduced in earlier vLLM versions is now the default in 0.8. Its key MoE benefit: the all-to-all dispatch communication for batch N can overlap with expert computation for batch N-1, hiding communication latency. This is the "zero-bubble" property — the GPU pipeline has no idle periods between consecutive batches.

The improvement is most pronounced for large Maverick-style models (128 experts) where dispatch communication is substantial.

Deploying Llama 4 Maverick with vLLM 0.8

Hardware Requirements

ModelArchitectureMin GPURecommendedVRAM (BF16)
Llama 4 Scout17B-16E, top-12× H100 80GB4× H100~80GB total
Llama 4 Maverick17B-128E, top-14× H100 80GB8× H100~350GB total
Maverick + EPLB17B-128E, top-18× H1008-16× H100~350GB + replica overhead

Serving Llama 4 Scout with Expert Parallelism

pip install vllm>=0.8.3

vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --tensor-parallel-size 2 \
  --enable-expert-parallel \
  --max-model-len 131072 \
  --host 0.0.0.0 \
  --port 8000

Scout's 16 experts fit well on 2-4 GPUs. With --tensor-parallel-size 2, each GPU owns 8 experts. EP provides better locality than TP for Scout's sparse forward passes.

Serving Llama 4 Maverick with EPLB

Maverick's 128-expert topology is where EPLB pays the largest dividend. The recommended configuration for 8× H100:

vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --enable-eplb \
  --max-model-len 1048576 \
  --gpu-memory-utilization 0.95 \
  --host 0.0.0.0 \
  --port 8000

EPLB is specifically valuable for Maverick because 128 experts over 8 GPUs means each GPU owns 16 experts by default — but the hot-expert problem is amplified with 128 total experts, since natural language workloads create stronger clustering among a small subset.

Verifying Expert Parallelism Is Active

After launch, the server log should show:

INFO: Expert parallelism enabled (EP size: 8)
INFO: EPLB enabled with window size: 50 batches

Query the OpenAI-compatible endpoint to confirm:

curl http://localhost:8000/v1/models
# Look for "meta-llama/Llama-4-Maverick-17B-128E-Instruct" in the response

Run a throughput benchmark:

python benchmarks/benchmark_throughput.py \
  --model meta-llama/Llama-4-Maverick-17B-128E-Instruct \
  --num-prompts 1000 \
  --input-len 512 \
  --output-len 128

Common Mistakes When Serving MoE Models

Mistake 1: Using tensor parallelism without expert parallelism for large MoE models

For 64+ expert models, TP alone creates unnecessary all-reduce traffic on every MoE forward pass. Always pair --tensor-parallel-size with --enable-expert-parallel for MoE models.

Mistake 2: Enabling EPLB without sufficient VRAM headroom

EPLB replicates hot experts, which increases VRAM usage. If you're running at --gpu-memory-utilization 0.99, EPLB may trigger OOM during rebalancing. Use 0.90-0.95 as the utilization target when EPLB is enabled.

Mistake 3: Expecting EPLB to help on balanced workloads

EPLB adds overhead (load collection, periodic remapping) on every forward pass. For single-tenant or homogeneous workloads where all requests come from the same domain, routing is already balanced and EPLB's overhead is a net negative. Enable it only for mixed workload serving.

Mistake 4: Forgetting to set --max-model-len explicitly for Maverick

Llama 4 Maverick supports up to 10 million tokens of context. The default vLLM max model length may be set based on available KV cache, which could be much smaller. For long-context workloads, set --max-model-len explicitly and size your hardware accordingly.

The Research Behind EPLB: Key Papers

The EPLB implementation in vLLM synthesizes several research directions:

DeepSeek EPLB (Dec 2025): DeepSeek's technical report introduced hierarchical and global load balancing policies for expert parallel inference. The hierarchical approach maintains group-level load statistics and performs two-stage balancing: first within a node, then across nodes. This reduces communication overhead compared to a single global rebalance.

ReaLB (arXiv:2604.19503, April 2026): Takes a different angle — instead of remapping experts, it dynamically adjusts the computation precision of experts at runtime on a per-EP-rank basis. For ranks handling fewer tokens, the experts compute at lower precision (FP4 via Tensor Cores), completing faster and reducing the idle wait for other ranks. Result: 1.29× layer-level speedup. The paper is implemented on top of vLLM and evaluated on Kimi-VL and Qwen3-VL.

Efficient MoE Serving in the Memory-Bound Regime (arXiv:2512.09277): Shows that alternatives to static EPLB routing achieve up to 21% higher total token throughput by minimizing the number of distinct experts that must be loaded per batch — a memory bandwidth optimization rather than a compute balance optimization.

These papers collectively point toward adaptive, runtime-aware routing as the next frontier beyond the static EPLB in vLLM 0.8.

Frequently Asked Questions

Q: Does Expert Parallelism work with tensor parallelism at the same time?

Yes. vLLM supports mixed parallelism (TP + EP). In this mode, non-MoE layers use tensor parallelism while MoE layers switch to expert parallelism. This is the recommended configuration for large MoE models where both approaches are needed. The TP group and EP group are configured independently.

Q: Is EPLB stable for production traffic?

As of vLLM 0.8.3, EPLB is production-ready for high-concurrency mixed-domain workloads. However, Red Hat's benchmarks found it neutral-to-negative on some H100 + DeepSeek configurations. Test with your specific workload before enabling in production. Disable EPLB and compare throughput under load.

Q: How does vLLM 0.8 compare to SGLang or TGI for MoE serving?

For MoE models in 2026, vLLM 0.8 and SGLang are the two strongest options. SGLang excels at structured output throughput and has its own expert parallelism implementation, while vLLM 0.8's EPLB gives it an edge on dynamic mixed-workload scenarios. TGI lacks native expert parallelism for sparse MoE models. See our full LLM inference engine comparison for side-by-side benchmarks.

Q: What is the performance difference between vLLM and Ollama for Llama 4 Scout?

vLLM significantly outperforms Ollama for server-side concurrent inference. Ollama's single-request optimization means throughput does not scale with concurrency. SitePoint's 2026 benchmark shows vLLM at approximately 3,000-12,500 tokens/sec (hardware-dependent) for 8B-14B models under concurrent load, versus Ollama's near-linear degradation past 4-8 concurrent requests. For Llama 4 Scout specifically, vLLM is the right choice for any production serving scenario.

Q: Can I run Llama 4 Scout on a single H100 80GB GPU?

Scout's total parameter count is approximately 109B (17B active, 16 experts × total params). In BF16, this requires roughly 80GB VRAM at minimum — right at the limit of a single H100. In practice, you also need VRAM for KV cache, so a single H100 leaves almost no room for context. Two H100s is the practical minimum for useful context lengths with Scout.

Q: How does vLLM 0.8's async scheduler interact with EPLB?

The async scheduler pipeline-parallelizes the all-to-all dispatch communication with expert computation. EPLB runs its load collection and remapping in the background, outside the critical path. The remapping itself is a brief operation (microseconds) that takes effect at batch boundaries. So EPLB and async scheduling compose cleanly — neither blocks the other.

Key Takeaways

  • MoE routing imbalance is a real throughput bottleneck: In skewed workloads, a single hot expert can receive 4× the mean token load, leaving 75%+ of GPU compute capacity idle while other ranks wait.
  • Expert Parallelism (EP) replaces Tensor Parallelism for MoE layers: Enable with --enable-expert-parallel. Moves from all-reduce (TP) to all-to-all dispatch (EP), which scales better for sparse models.
  • EPLB redistributes hot experts at runtime: --enable-eplb periodically remaps expert assignments to balance load across EP ranks. Most effective for high-concurrency mixed-domain workloads.
  • Leave VRAM headroom when EPLB is on: Expert replication increases peak VRAM usage. Use --gpu-memory-utilization 0.90-0.95, not the default maximum.
  • Research is moving beyond static EPLB: Papers like ReaLB (arXiv:2604.19503) show dynamic precision adjustment as an alternative approach; expect vLLM 0.9+ to incorporate more adaptive strategies.
  • Production deployment guide: For vLLM server configuration, hardware sizing, and monitoring in production, see our vLLM in Production guide.
Bottom Line

vLLM 0.8 is the right foundation for Llama 4 Maverick in production. Expert Parallelism with EPLB is not a silver bullet — it adds overhead and benefits vary by workload — but for mixed-domain, high-concurrency serving, it is the difference between burning GPU budget on idle synchronization barriers and actually saturating your hardware.

Need content like this
for your blog?

We run AI-powered technical blogs. Start with a free 3-article pilot.

Learn more →

More in Articles

Stay in the loop.

One dispatch every Friday. New articles, tool releases, and a short note from the editor.

Get weekly AI tool reviews & automation tips

Join our newsletter. No spam, unsubscribe anytime.