Effloow
Effloow
est. 2026 · v2.0
Service
Channels
Pages
~ / articles / vllm-production-inference-guide-2026 Apr 22 · Wednesday
← Back to Articles
ARTICLES ·2026-04-20 ·BY EFFLOOW CONTENT FACTORY

vLLM in Production: Open-Source LLM Inference Engine Guide 2026

2026 guide to vLLM in production: v1 architecture, Model Runner V2, Docker/Kubernetes setup, benchmarks vs SGLang and TGI, and monitoring tips.
vllm llm-inference ai-infrastructure open-source kubernetes docker production-ai
SHARE
vLLM in Production: Open-Source LLM Inference Engine Guide 2026

There is a quiet consensus forming among AI infrastructure teams in 2026: if you are serving open-weight LLMs at scale, you are probably running vLLM. Amazon uses it to power Rufus, the shopping assistant that handles millions of product queries daily. Roblox deployed it to serve over 4 billion tokens per week for their AI assistant, cutting latency in half in the process. Meta, Mistral AI, IBM, and Stripe all run vLLM in production. It joined the PyTorch Foundation, with compute resources flowing in from Alibaba Cloud, AMD, AWS, Google Cloud, NVIDIA, and Red Hat.

The reason is not marketing. vLLM earned its position through a series of architectural decisions that compound: PagedAttention for memory efficiency, then continuous batching for throughput, and now, in 2026, a completely rewritten execution core called Model Runner V2. This guide explains what changed, why it matters, and exactly how to deploy it — from a single GPU on your workstation to a multi-node Kubernetes cluster processing hundreds of requests per second.

Why This Matters Now

LLM inference has a physics problem. Attention is quadratic in sequence length. The KV cache for a single long-context request can consume gigabytes of GPU memory. Traditional batching strategies that worked for CNNs and BERT-class models break down when requests have wildly different lengths and are being processed simultaneously at different positions in their generation.

vLLM's original breakthrough, PagedAttention, solved this by borrowing a concept from operating systems: virtual memory paging. Instead of pre-allocating a contiguous block of GPU memory for each request's KV cache, PagedAttention divides the KV cache into fixed-size "pages" and maps them on demand — exactly like how an OS maps virtual pages to physical frames. Fragmentation drops from 60–80% in naive implementations to under 4%. You can fit more concurrent requests on the same GPU.

The March 2026 release of vLLM v0.17.1 with Model Runner V2 (MRV2) is the second major inflection point. The execution core was rebuilt from scratch around three design principles: be modular, be GPU-native, and be async-first. The results are measurable: 56% higher throughput on NVIDIA GB200 hardware, and meaningfully lower TTFT across the board.

The v1 Architecture: What Actually Changed

The original vLLM engine scheduled tokens by copying intermediate tensors between GPU and CPU during each scheduling decision. At high concurrency, this CPU-GPU round-trip became the bottleneck. The V1 engine eliminated it.

The V1 architecture splits the process into distinct layers communicating via ZeroMQ sockets: the client-facing OpenAI-compatible API server, the scheduler, the engine core, and GPU workers each run in isolation. Scheduling decisions happen without touching the GPU. Workers receive clean, pre-computed inputs.

Model Runner V2 goes further. Three changes define it:

GPU-native input preparation. Previously, the model runner prepared attention metadata on the CPU and transferred it to the GPU before each forward pass. MRV2 moves this work onto the GPU itself using Triton kernels. The CPU is no longer in the hot path.

Async-first execution. The scheduler and worker now overlap: while the GPU executes step N, the scheduler prepares step N+1. In practice this means CPU overhead — which historically scaled poorly with batch size — becomes nearly invisible.

Triton-native sampler. Token sampling is now a Triton kernel rather than a PyTorch operation, eliminating CPU-GPU synchronization points that previously interrupted the pipeline.

The practical outcome: under heavy load (100+ concurrent users), vLLM v0.17.1 keeps the GPU at 85–92% utilization. The previous architecture would fall to 70–75% as scheduling overhead grew.

Core Concepts You Need to Know

PagedAttention and KV Cache Management

Every generative inference request accumulates a Key-Value cache as it generates tokens. This cache must persist across all decoding steps and grows with sequence length. On a single H100 80GB, a Llama 3.1 70B model with 128K context requires roughly 32GB of KV cache per request at full context — you simply cannot pre-allocate that for many requests simultaneously.

PagedAttention manages this with a block table: a mapping from (request_id, layer, head) to physical memory blocks. Blocks are 16–32 tokens each by default. When a request finishes, its blocks are immediately returned to the pool. New requests pull from the pool on demand. The system behaves like a memory allocator with near-zero fragmentation.

Continuous Batching

Static batching — waiting to fill a batch before starting inference — destroys latency for LLMs because tokens are generated one at a time and requests complete at different steps. Continuous batching (also called iteration-level scheduling) adds new requests to the batch mid-generation as slots free up. A request that finishes at step 47 frees its KV cache blocks immediately; a new request can claim them at step 48. GPU utilization stays high without sacrificing per-request latency.

Speculative Decoding

vLLM's V1 engine supports speculative decoding with zero-bubble async scheduling. A small "draft" model proposes multiple tokens; the main model verifies them in parallel. When draft and target agree, you get multiple tokens from a single forward pass of the target model. For workloads where the draft model prediction accuracy is high — code completion, templates, structured outputs — this can double effective throughput without touching the model or the hardware.

Disaggregated Prefill and Decode

Prefill (processing the input prompt) is compute-bound: you want large batch sizes and maximum arithmetic intensity. Decode (generating each new token) is memory-bandwidth-bound: each step reads the entire KV cache but does minimal compute. Running both in the same scheduler creates head-of-line blocking: a long prefill blocks decode steps for in-flight requests, causing latency spikes.

vLLM's V1 engine schedules these phases independently. In multi-node setups, you can route prefill requests to dedicated nodes and decode requests to others, each optimized for their respective bottleneck.

Benchmarks: vLLM vs SGLang vs TGI

EngineThroughput (H100, Llama 8B)GPU UtilizationStatus (2026)Best For
vLLM v0.17.1~12,500 tok/s85–92%Active (PyTorch Foundation)General production workloads
SGLang~16,200 tok/s (+29%)90–94%ActiveMulti-turn, shared-prefix workloads
TGI~9,800 tok/s68–74%Maintenance mode (Dec 2025)Legacy HuggingFace integrations
TensorRT-LLM~18,000 tok/s95%+Active (NVIDIA-specific)NVIDIA-only, maximum throughput

A few things this table does not show: SGLang's 29% throughput advantage comes specifically from RadixAttention, which caches shared prefixes across requests as a Radix Tree. If your requests do not share prefixes — a diverse generation workload, for example — vLLM and SGLang perform similarly. The TGI numbers are included for context but Hugging Face officially recommends migrating to vLLM or SGLang for new production deployments.

TensorRT-LLM is the fastest on paper but requires NVIDIA hardware, CUDA optimization passes per model, and a more complex deployment pipeline. It is the right choice when you need every last tok/s on a fixed NVIDIA cluster. vLLM is the right choice when you need something that works reliably across hardware, models, and team members with varying infrastructure expertise.

Getting Started: Single-GPU Setup

The fastest path is the official Docker image. Prerequisites: NVIDIA driver ≥ 525, CUDA ≥ 12.1, NVIDIA Container Toolkit ≥ 1.14, Docker Engine ≥ 23.0.

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:v0.17.1 \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

The --max-model-len flag limits the KV cache allocation. If you set it too high, you risk OOM errors under concurrent load. If you set it too low, long-context requests will be rejected. A conservative starting point is 32768 (32K tokens) on an 80GB GPU for a 70B model; you can increase it as you understand your actual workload distribution.

Once running, the server exposes an OpenAI-compatible API:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "prompt": "Explain vLLM PagedAttention in one paragraph:",
    "max_tokens": 256,
    "temperature": 0.7
  }'

Any client that works with OpenAI's API works here — LangChain, LlamaIndex, the official OpenAI Python SDK with a custom base_url.

Multi-GPU Setup with Tensor Parallelism

For models that do not fit on a single GPU (70B, 405B, and larger), vLLM uses tensor parallelism to shard the model weights across multiple GPUs. The --tensor-parallel-size flag controls the shard count, which must equal the number of GPUs you are using.

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:v0.17.1 \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.92

For multi-node deployments (models that need more than 8 GPUs), vLLM supports pipeline parallelism via --pipeline-parallel-size. This splits the model's layers across nodes. The V1 engine's piecewise CUDA graphs preserve graph-captured execution across pipeline boundaries, a technical detail that eliminates the synchronization overhead that made pipeline parallelism slow in the V0 engine.

Kubernetes Deployment

A production Kubernetes deployment needs four things: a Deployment with GPU resource requests, a Service for stable DNS, Prometheus scraping for metrics, and horizontal scaling tied to queue depth rather than CPU.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
  namespace: ml-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-server
  template:
    metadata:
      labels:
        app: vllm-server
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.17.1
        args:
          - "--model"
          - "meta-llama/Meta-Llama-3.1-8B-Instruct"
          - "--max-model-len"
          - "32768"
          - "--gpu-memory-utilization"
          - "0.90"
          - "--tensor-parallel-size"
          - "1"
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: "48Gi"
          requests:
            nvidia.com/gpu: "1"
            memory: "32Gi"
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 180
          periodSeconds: 30
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token

The initialDelaySeconds: 120 for the readiness probe is important. vLLM loads the full model weights before accepting traffic; this can take 90–180 seconds depending on model size and storage speed. If the probe fires too early, Kubernetes kills the pod before it finishes loading.

For autoscaling, KEDA (Kubernetes Event-Driven Autoscaling) lets you scale on the vLLM queue depth metric exposed via Prometheus, which is far more meaningful than CPU utilization for GPU inference workloads.

Monitoring: The Four Metrics That Matter

vLLM exposes a Prometheus endpoint at /metrics. Four signals cover most production alerting needs:

Time to First Token (TTFT): The interval between request arrival and the first generated token. This is the metric users experience as latency. Alert threshold: P99 above 5 seconds under normal load.

Inter-Token Latency (ITL): The average time between successive tokens for streaming responses. High ITL manifests as choppy streaming. Alert threshold: P99 above 200ms.

KV Cache Utilization: The fraction of KV cache blocks in use. When this approaches 100%, new requests start getting rejected or queued. Alert threshold: above 90% sustained.

Queue Depth: The number of requests waiting for a KV cache slot. Rising queue depth is the earliest signal that you need more capacity. Alert threshold: above 10 sustained for more than 60 seconds.

# Quick health check
curl http://localhost:8000/metrics | grep -E "vllm:(kv_cache|request_queue|ttft)"

Common Mistakes

Setting --gpu-memory-utilization too high. The default is 0.90. Setting it to 0.98 or 1.0 might seem like it gives you more KV cache space, but it leaves no headroom for framework overhead, fragmentation, and unexpected spikes. OOM kills under load are worse than slightly reduced cache capacity.

Not pinning the Docker image tag. vllm/vllm-openai:latest will silently upgrade between minor versions. Pin to a specific tag in production and upgrade deliberately.

Ignoring --max-model-len for long-context models. Models like Llama 4 Maverick support 10 million token context windows. That does not mean you should allocate KV cache for 10M tokens. Set --max-model-len to match your actual workload's 99th-percentile request length, not the model's theoretical maximum.

Using tensor parallelism across slow network links. Tensor parallelism requires high-bandwidth interconnect between GPUs. On a single node with NVLink, it works well. Across nodes over 10GbE Ethernet, the all-reduce communication overhead exceeds the computation savings. Use pipeline parallelism for multi-node setups instead.

Not configuring --block-size for your workload. The default block size (16 tokens) is a good general choice but not optimal for all workloads. Long-context workloads with few requests benefit from larger blocks (32 or 64 tokens); high-concurrency short-context workloads benefit from smaller blocks.

FAQ

Q: How does vLLM compare to running models through the OpenAI API?

vLLM is a self-hosted inference server, not a managed API. You get lower per-token cost at scale (no vendor margin), full control over the model and its configuration, and data privacy. The tradeoff is operational responsibility: you manage the hardware, the deployment, and the on-call rotation. For many teams serving more than ~10M tokens per day, the economics favor self-hosting. Below that threshold, managed APIs are usually more cost-effective when you factor in engineering time.

Q: Does vLLM support quantized models?

Yes. vLLM supports GPTQ, AWQ, and FP8 quantization natively. For production, AWQ (Activation-aware Weight Quantization) with 4-bit weights typically delivers the best accuracy-per-VRAM tradeoff. A 70B model quantized to 4-bit with AWQ fits on two 80GB H100s with room for a substantial KV cache, versus four GPUs for BF16.

Q: Can vLLM serve multiple models simultaneously?

Not natively from a single process — each vLLM instance serves one model. For multi-model serving, you run multiple vLLM instances (one per model) and route requests at the load balancer level, or use a gateway like LiteLLM that sits in front of multiple vLLM instances. This is the pattern used by teams that need to serve a mix of models without paying for a separate managed API per model.

Q: What hardware does vLLM support?

NVIDIA GPUs (Ampere and newer recommended, Pascal supported with limitations), AMD ROCm GPUs, Intel Gaudi HPUs, AWS Inferentia 2, and Google Cloud TPUs via XLA. The core PagedAttention and continuous batching logic is hardware-agnostic; the low-level kernels have hardware-specific implementations. NVIDIA remains the highest-performance and best-supported target.

Q: How do I handle model warmup for Kubernetes readiness?

vLLM automatically runs warmup inference passes after model loading to capture CUDA graphs. The readiness probe's initialDelaySeconds must be long enough to cover both model loading and warmup. For larger models (70B+), budget 180–300 seconds. You can monitor the startup log for the message INFO: Application startup complete. to calibrate this for your specific model and storage speed.

Key Takeaways

Bottom Line

vLLM is the de facto standard for production open-weight LLM inference in 2026. The v0.17.1 release with Model Runner V2 delivers 56% better throughput on modern hardware via GPU-native Triton kernels and true async scheduling. If you are evaluating inference engines for a new deployment, start with vLLM — it has the broadest hardware support, the largest community, and the most mature production tooling. Switch to SGLang if your workload is heavily multi-turn with shared prefixes; switch to TensorRT-LLM if you need absolute maximum throughput on NVIDIA hardware and can absorb the operational complexity.

  • vLLM v0.17.1 (Model Runner V2) is the recommended production version as of April 2026, delivering up to 56% higher throughput on GB200 hardware versus v0 architecture
  • PagedAttention + continuous batching are still the core value proposition; MRV2 builds on top of them with GPU-native execution and async scheduling
  • Docker deployment takes under 10 minutes for a single GPU; Kubernetes with Helm adds autoscaling and monitoring for production scale
  • Monitor four metrics: TTFT (P99 < 5s), inter-token latency (P99 < 200ms), KV cache utilization (< 90%), and queue depth (< 10 sustained)
  • TGI entered maintenance mode in December 2025 — if you are running TGI in production, plan a migration to vLLM or SGLang
  • For multi-model serving, combine vLLM instances with a gateway layer (LiteLLM, custom load balancer) — vLLM itself serves one model per process

Prefer a deep-dive walkthrough? Watch the full video on YouTube.

Need content like this
for your blog?

We run AI-powered technical blogs. Start with a free 3-article pilot.

Learn more →

More in Articles

Stay in the loop.

One dispatch every Friday. New articles, tool releases, and a short note from the editor.

Get weekly AI tool reviews & automation tips

Join our newsletter. No spam, unsubscribe anytime.