ARTICLES ·2026-05-21 ·BY EFFLOOW CONTENT FACTORY

Modal Labs: Serverless GPU for vLLM with No YAML

Deploy open-source LLMs via vLLM on Modal Labs with Python decorators, GPU snapshotting, and $30 free credits — no Kubernetes, no YAML.

modal-labs vllm serverless-gpu llm-inference ai-infrastructure open-source-llm python

Modal Labs: Serverless GPU for vLLM with No YAML

Running open-source LLMs in production has a dirty secret: the inference infrastructure is often harder than the model itself. Kubernetes manifests, GPU node pools, autoscaling policies, health checks — by the time you've configured all that, you could have shipped three features instead. Modal Labs takes a different position. Write a Python function, tag it with a decorator, and your model is live on a GPU. No YAML. No Kubernetes. No Dockerfile committed to a repo.

This guide walks through what Modal actually offers for LLM serving in 2026, how it fits with vLLM, how GPU snapshotting changes the cold-start calculus, and when Modal makes more financial sense than a reserved instance.

What Is Modal and Why Does It Matter for LLM Serving

Modal is a serverless compute platform built specifically for Python ML workloads. Unlike general-purpose serverless platforms (AWS Lambda, Google Cloud Functions), Modal was designed from day one to support GPU-accelerated containers. The core DX idea is that your infrastructure lives inside your Python file, not in a separate YAML stack.

Instead of writing a Kubernetes Deployment manifest and a Service object and a HorizontalPodAutoscaler, you write this:

import modal

app = modal.App("my-llm-api")

@app.cls(gpu="A10G", container_idle_timeout=300)
class Inference:
    @modal.enter()
    def load(self):
        from vllm import LLM
        self.engine = LLM(model="meta-llama/Llama-3.2-3B-Instruct")
    
    @modal.method()
    def generate(self, prompt: str) -> str:
        outputs = self.engine.generate([prompt])
        return outputs[0].outputs[0].text

That class is your entire deployment unit. Modal handles the container build, GPU allocation, autoscaling from zero, and teardown when traffic drops. For individual developers or small teams without dedicated infra engineers, this removes weeks of configuration work.

Modal is not new — it launched in 2021 — but its 2025/2026 feature additions (GPU snapshotting, the @modal.web_server pattern for OpenAI-compatible endpoints, and expanded H100 availability) have made it notably more competitive for production LLM serving.

The vLLM Connection

vLLM is the dominant open-source LLM inference engine as of 2026. Its continuous batching, PagedAttention memory management, and OpenAI-compatible REST API make it the default choice for anyone self-hosting models like Llama 3, Qwen 2.5, Mistral, or Gemma 4.

Modal and vLLM fit together naturally. Modal supplies the GPU container and the scaling logic. vLLM runs inside that container and handles the actual inference. The combination gives you:

OpenAI-compatible API — any client already using the OpenAI SDK can point at your Modal endpoint with a base URL change
Autoscaling from zero — Modal scales containers up when traffic arrives and scales back to zero when idle, so you pay only for actual inference time
Latest vLLM builds — you pin the vLLM version in your Modal image definition; no waiting for a managed platform to update their runtime

A basic vLLM + Modal setup looks like this:

import modal

app = modal.App("vllm-openai-server")

vllm_image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install(
        "vllm==0.8.0",
        "huggingface_hub[hf_transfer]",
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
)

MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"
MODEL_DIR = "/models"

@app.cls(
    gpu="A10G",
    image=vllm_image,
    volumes={MODEL_DIR: modal.Volume.from_name("llm-weights", create_if_missing=True)},
    container_idle_timeout=300,
    allow_concurrent_inputs=100,
)
class VLLMServer:
    @modal.enter()
    def load_model(self):
        from vllm import LLM, SamplingParams
        self.llm = LLM(
            model=MODEL_ID,
            download_dir=MODEL_DIR,
            dtype="auto",
            max_model_len=8192,
        )
        self.sampling_params = SamplingParams(temperature=0.7, max_tokens=512)

    @modal.method()
    def complete(self, prompt: str) -> str:
        outputs = self.llm.generate([prompt], self.sampling_params)
        return outputs[0].outputs[0].text

    @modal.web_server(8000, startup_timeout=300)
    def serve(self):
        import subprocess
        subprocess.Popen([
            "python", "-m", "vllm.entrypoints.openai.api_server",
            "--model", MODEL_ID,
            "--served-model-name", "llama-3",
            "--host", "0.0.0.0",
            "--port", "8000",
        ])

Deploy with a single command:

modal deploy vllm_server.py

Modal outputs a public HTTPS endpoint. Your existing OpenAI client hits it with base_url="https://your-workspace--vllm-openai-server-vllmserver-serve.modal.run/v1".

GPU Snapshotting: What Changed in 2025–2026

Cold start latency has historically been the main argument against serverless GPU inference. Loading a 7B parameter model into GPU memory can take 60–120 seconds. That's tolerable for batch jobs, but it is a poor user experience for latency-sensitive API calls.

Modal's GPU memory snapshots change this equation. The feature serializes the full GPU memory state to disk the first time a container reaches a ready state. Subsequent container starts deserialize that snapshot instead of reloading weights from disk — which is significantly faster.

Modal's own benchmarks (from their official blog post on Mistral 3 deployment) show:

Ministral 3B (3B parameters): cold start ~118s → ~12s (approximately 10x improvement)
Qwen2.5-0.5B-Instruct with vLLM: cold start ~45s → ~5s

These figures come from Modal's own published benchmarks, not independent third-party testing.

Enabling snapshots requires two additions:

@app.cls(
    gpu="A10G",
    image=vllm_image,
    experimental_options={"enable_gpu_snapshot": True},  # Enable snapshot
)
class VLLMServer:
    @modal.enter(snap=True)  # Mark this method as snapshotable
    def load_model(self):
        from vllm import LLM
        self.llm = LLM(
            model=MODEL_ID,
            enable_sleep_mode=True,  # Required for GPU snapshotting
        )

The enable_sleep_mode=True parameter in vLLM shifts most GPU memory to CPU memory during idle periods, which enables the snapshot mechanism. It is currently in experimental status in both Modal and vLLM, so expect API changes.

Understanding Modal's Pricing Model

Modal bills per second of actual compute usage. There is no minimum reservation or idle charge (if your container scales to zero, you pay nothing). The current pricing for key GPU types as listed on modal.com/pricing:

GPU Type	Per-Second Rate	Hourly Equivalent	Best For
NVIDIA H100 SXM	$0.001097/sec	~$3.95/hr	Large models (70B+), max throughput
A100 80GB	$0.000694/sec	~$2.50/hr	13B–70B models, production serving
A100 40GB	$0.000583/sec	~$2.10/hr	7B–13B models
A10G	$0.000306/sec	~$1.10/hr	3B–7B models, cost-optimized
T4	~$0.000164/sec	~$0.59/hr	Development, small models

Rates are subject to change; verify current prices at modal.com/pricing before making budget decisions.

The Starter plan gives every account $30/month in free compute credits with no monthly fee. At A10G rates, that covers roughly 27 hours of GPU time per month — enough to run a 3B model endpoint for a small production workload at modest traffic levels.

Cost Comparison: Modal vs. Reserved Instance

For teams currently running a reserved g5.xlarge on AWS (which uses an A10G), the math depends entirely on utilization:

A g5.xlarge reserved (1-year, no upfront) runs approximately $0.76/hr in us-east-1 (spot pricing lower but less reliable)
Modal A10G on-demand: ~$1.10/hr when containers are running

At 100% utilization, a reserved instance beats Modal. But most LLM endpoints don't see 100% utilization. At 30–40% utilization, Modal's scale-to-zero behavior makes it cost-competitive or cheaper. The crossover point is typically around 50–60% sustained GPU utilization.

Practical Pattern: Persistent Model Volume

One friction point with serverless GPU is re-downloading model weights on every cold start. Modal Volumes solve this — they are persistent network-attached storage that survives container teardowns.

model_volume = modal.Volume.from_name("llm-weights-v1", create_if_missing=True)
MODEL_CACHE = "/vol/models"

@app.function(
    image=vllm_image,
    volumes={MODEL_CACHE: model_volume},
    gpu="A10G",
    timeout=1800,
)
def download_model():
    from huggingface_hub import snapshot_download
    snapshot_download(
        repo_id="meta-llama/Llama-3.2-3B-Instruct",
        local_dir=MODEL_CACHE,
        ignore_patterns=["*.pt", "original/*"],
    )
    model_volume.commit()

# In your inference class, mount the same volume
@app.cls(
    gpu="A10G",
    volumes={MODEL_CACHE: model_volume},
)
class InferenceWithCache:
    ...

Run modal run download_script.py::download_model once to populate the volume. After that, all inference containers read from the volume instead of pulling from Hugging Face, which brings cold start times down significantly even without GPU snapshotting.

Concurrency and Scaling

Modal's allow_concurrent_inputs parameter controls how many requests a single container instance handles at once. For vLLM, which already handles batching internally, setting this to 100–200 lets vLLM's continuous batching operate efficiently across multiple simultaneous requests:

@app.cls(
    gpu="A10G",
    allow_concurrent_inputs=100,
    concurrency_limit=5,     # Max 5 containers (5 A10Gs) at once
    container_idle_timeout=300,
)
class ScaledInference:
    ...

concurrency_limit acts as your cost ceiling — setting it to 5 means you will never accidentally spin up 50 GPUs during a traffic spike. Useful for development. Remove or raise it for production.

When to Use Modal (and When Not To)

Strengths

Zero infrastructure files — no Dockerfile, no YAML, no Helm chart
Scale to zero billing — ideal for endpoints with variable or bursty traffic
GPU snapshotting cuts repeated cold starts from minutes to seconds
$30/month free tier covers real testing and light production use
Works with any Python ML library without platform-specific constraints
Persistent volumes solve the weight re-download problem

Limitations

First cold start (before snapshot exists) still takes 45–120s depending on model size
GPU snapshotting is experimental; API stability not guaranteed
At sustained high utilization (60%+), reserved instances beat Modal on cost
No multi-GPU tensor parallelism on the free tier; H100 multi-GPU setups require contacting Modal
Vendor dependency — your deployment logic lives in Modal's SDK and platform

Modal fits best when:

You're a solo developer or small team without dedicated infra engineers
Your LLM endpoint sees variable traffic (nights/weekends it mostly sits idle)
You want to prototype and deploy quickly without setting up a Kubernetes cluster
You're serving models up to ~13B parameters on A100 or A10G

Modal is a harder sell when:

You need multi-GPU serving for 70B+ models at production scale (Modal can do this but setup is more involved)
Your traffic is predictable and high — reserved instances or managed inference platforms (Together AI, Fireworks AI) will be cheaper
Your team already manages Kubernetes and has the infra expertise to operate it efficiently

FAQ

Q: Do I need a credit card to try Modal?

Yes, a credit card is required to activate the free Starter plan. The $30 monthly credit is applied automatically, and you won't be charged until you exceed the credit.

Q: Can I use any open-source model from Hugging Face?

Any model your GPU has VRAM to load. Llama 3.2 3B, Qwen 2.5 7B, Mistral 7B, and Gemma 4 all run comfortably on an A10G (24GB VRAM). For larger models (34B+), use A100 or H100.

Q: How does Modal handle Hugging Face gated models?

Pass your HF token as a Modal Secret:

@app.cls(
    gpu="A10G",
    secrets=[modal.Secret.from_name("huggingface-secret")],
)
class GatedModelInference:
    ...

Create the secret with modal secret create huggingface-secret HF_TOKEN=hf_your_token_here.

Q: Is the OpenAI-compatible endpoint HTTPS?

Yes. Modal provisions a TLS-terminated HTTPS endpoint automatically for every @modal.web_server deployment. No cert management needed.

Q: What happens if I exceed the $30 free credit?

Usage beyond the credit is billed to your card at the standard per-second rates. Modal provides a usage dashboard and you can set spending alerts.

Key Takeaways

Modal's value proposition for LLM serving is clear: it trades marginal cost efficiency at high utilization for significant developer time savings and operational simplicity. If your team is spending engineering hours on Kubernetes config that could be going into product features, and your inference workloads aren't running at 70%+ GPU utilization, Modal is worth a serious evaluation.

The GPU snapshotting feature represents a genuine step forward for serverless LLM infrastructure. Cold start latency was the historical barrier; cutting a 118-second restart to 12 seconds removes it for most interactive use cases. Whether that feature stabilizes out of experimental status will be worth watching through the rest of 2026.

For teams running vLLM already — the integration story is clean. Same vLLM API, same OpenAI-compatible endpoints, same Python you already know. Just less YAML.

Bottom Line

Modal is the fastest path from "I want to serve a 7B model" to a live HTTPS endpoint for teams without dedicated infra. GPU snapshotting has significantly reduced the cold-start disadvantage. The $30 free tier makes it easy to evaluate without commitment.

Need content like this
for your blog?

We run AI-powered technical blogs. Start with a free 3-article pilot.

Learn more →