Self-Hosting LLMs vs Cloud APIs: Cost, Performance & Privacy Compared (2026)
Practical comparison of self-hosting LLMs with Ollama, vLLM, and llama.cpp versus cloud APIs from OpenAI, Anthropic, and Google. Covers cost-per-token modeling, hardware requirements, latency, and when each approach wins.
Self-Hosting LLMs vs Cloud APIs: Cost, Performance & Privacy Compared (2026)
The question used to be simple: can you even run a useful LLM locally? In 2026, the answer is definitively yes. Open-weight models like Llama 3.3, Qwen 3, DeepSeek R1, and Mistral Large rival proprietary models on many benchmarks. Consumer GPUs have enough VRAM to run 70B-parameter models. Tools like Ollama make local inference as easy as pulling a Docker image.
But "can" and "should" are different questions. Cloud APIs from OpenAI, Anthropic, and Google keep getting cheaper, faster, and more capable. The real decision in 2026 is not about possibility — it is about economics, performance requirements, and privacy constraints.
This guide breaks down the actual numbers. No hand-waving, no vendor hype — just a practical cost-per-token comparison, hardware requirements, and a framework for deciding which approach fits your workload.
If you are building with AI coding tools specifically, our comparison of terminal AI coding agents covers which agents use local vs cloud inference under the hood. For a full pricing breakdown, see our AI coding tools pricing comparison.
How Cloud API Pricing Works in 2026
Cloud LLM providers charge per token — typically quoted per million input tokens and per million output tokens. The spread between models is enormous, so "cloud API pricing" is not one number.
Current Pricing Snapshot
Prices below are per 1 million tokens as listed on official pricing pages. These change frequently — always verify before making infrastructure decisions.
OpenAI (as of April 2026):
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| GPT-4.1 | $2.00 | $8.00 |
| GPT-4.1 mini | $0.40 | $1.60 |
| GPT-4.1 nano | $0.10 | $0.40 |
| GPT-5 | $1.25 | $10.00 |
| o3 | $2.00 | $8.00 |
| o3-mini | $1.10 | $4.40 |
| o4-mini | $1.10 | $4.40 |
Anthropic (as of April 2026):
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Claude Haiku 4.5 | $1.00 | $5.00 |
| Claude Sonnet 4.6 | $3.00 | $15.00 |
| Claude Opus 4.6 | $5.00 | $25.00 |
Both providers offer significant discounts through caching and batching. Anthropic's prompt caching reduces input costs by roughly 90% for repeated context. Their Batch API cuts all token costs by 50% for non-real-time workloads. Combined, these optimizations can reduce effective costs by up to 95% for the right use cases.
The takeaway: If you are making occasional API calls or processing under 2 million tokens per day, cloud APIs are almost certainly cheaper than any self-hosted setup. You pay nothing when idle.
The True Cost of Self-Hosting
Self-hosting sounds free after you buy the hardware. It is not. The real cost includes hardware amortization, electricity, cooling, maintenance time, and opportunity cost. Here is what the numbers actually look like.
Hardware Requirements by Model Size
| Model Size | Minimum VRAM (Q4 Quantized) | Recommended GPU | Approximate GPU Cost |
|---|---|---|---|
| 7–8B parameters | 6 GB | RTX 4060 Ti (16 GB) | $400–$500 |
| 13B parameters | 10 GB | RTX 4070 Ti (16 GB) | $700–$800 |
| 30–34B parameters | 20 GB | RTX 4090 (24 GB) | $1,600–$2,000 |
| 70B parameters | 40 GB | 2× RTX 4090 or RTX 5090 (32 GB) | $2,000–$4,000 |
| 70B parameters (FP16) | 140 GB | 2× A100 (80 GB) | $20,000+ |
| 405B parameters | 200+ GB | 4× A100 or 8× RTX 4090 | $50,000+ |
The RTX 5090, released in January 2025, adds 32 GB of GDDR7 memory with 1.79 TB/s bandwidth — a meaningful upgrade over the RTX 4090's 24 GB and 1.01 TB/s. It delivers roughly 213 tokens/second on 8B models (67% faster than the RTX 4090's 128 tok/s). But at $2,000–$3,800 with limited availability, it is not always easy to buy.
Running Costs
Electricity is often underestimated. A single RTX 4090 under sustained inference load draws 350–450W. At the US average of $0.16/kWh:
- 24/7 operation: ~$40–$55 per month per GPU
- 8 hours/day operation: ~$13–$18 per month per GPU
Add $5–$15/month for cooling overhead in most setups. For a dual-GPU rig running 70B models around the clock, expect $90–$130/month in electricity alone.
Amortized Cost Per Token
Here is where the math gets interesting. Assuming a 3-year hardware amortization:
| Setup | Hardware Cost | Monthly Amortization | Monthly Electricity | Throughput | Effective Cost per 1M Tokens |
|---|---|---|---|---|---|
| RTX 4090 + 8B model (Q4) | $1,800 | $50 | $45 | ~330M tokens/day | ~$0.009 |
| RTX 5090 + 30B model (Q4) | $3,000 | $83 | $55 | ~160M tokens/day | ~$0.028 |
| 2× RTX 4090 + 70B model (Q4) | $3,600 | $100 | $90 | ~80M tokens/day | ~$0.079 |
Compare that to GPT-4.1 mini at $0.40/$1.60 per million tokens, or Claude Haiku 4.5 at $1.00/$5.00. On raw per-token cost at scale, self-hosting wins — but only if your GPUs are actually busy.
The hidden cost of idle hardware: If your workload averages 1 million tokens per day on a rig capable of 300 million, you are paying $95/month in fixed costs for a workload that would cost $1–$5/month on a cloud API.
Self-Hosting Tools: Ollama vs vLLM vs llama.cpp
Three tools dominate the self-hosting landscape in 2026. Each targets a different use case.
Ollama — The Developer's Default
Ollama wraps llama.cpp in a Go-based server with a Docker-like experience. One command pulls and runs models with an OpenAI-compatible API endpoint.
Best for: Development, prototyping, personal use, privacy-focused workflows, air-gapped environments. For a complete setup walkthrough, see our Ollama + Open WebUI self-hosting guide.
Strengths:
- Zero-configuration setup —
ollama run llama3.3and you are running - Automatic quantization and GPU detection
- OpenAI-compatible API (drop-in replacement in most SDKs)
- The 0.17 series (early 2026) added cloud model offloading, web search, multimodal support, streaming tool calls, and thinking models
- Native tool calling support for external API integration
Limitations:
- Caps at ~4 concurrent requests by default
- ~62 tok/s on Llama 3.1 8B (Q4_K_M quantization)
- Not designed for production-scale multi-user serving
vLLM — The Production Choice
vLLM's PagedAttention engine manages GPU memory like an operating system manages RAM — paging model weights in and out to maximize throughput under concurrent load.
Best for: Production APIs, multi-user serving, high-throughput batch processing.
Strengths:
- Continuous batching aggregates concurrent requests into unified GPU operations
- ~485 total tok/s across 10 concurrent requests on Llama 3.1 8B — 16× more throughput than Ollama under load
- Over 35× the request throughput (RPS) compared to llama.cpp at peak load
- Speculative decoding support for faster generation
- OpenAI-compatible serving endpoint
Limitations:
- More complex setup than Ollama
- Requires NVIDIA GPUs (no CPU-only mode)
- Higher minimum memory overhead
llama.cpp — The Embedded Option
The pure C/C++ inference engine with no external dependencies. It runs everywhere — from data center GPUs to Android phones.
Best for: Embedded applications, edge deployment, mobile devices, maximum hardware compatibility.
Strengths:
- Runs on CPU, Apple Silicon, NVIDIA, AMD, and mobile devices
- Smallest footprint of the three
- 4-bit quantization enables Llama 3.2 3B on standard Android devices
- Direct library embedding into native applications
Limitations:
- No built-in serving layer (you add your own HTTP server)
- Lower throughput than vLLM under concurrent load
- More manual configuration required
Quick Decision Matrix
| Scenario | Recommended Tool |
|---|---|
| Developer testing locally | Ollama |
| Team staging server | llama.cpp or Ollama |
| Production user-facing API | vLLM |
| Mobile/embedded deployment | llama.cpp |
| Privacy-first personal assistant | Ollama |
For context on how these tools connect to the broader AI tooling ecosystem, see our guide to MCP (Model Context Protocol) — the emerging standard for how AI tools integrate with external services. You may also want to compare Docker Model Runner vs Ollama for container-based local AI deployment.
Performance: Latency, Throughput, and Quality
Latency
Cloud APIs add network round-trip time — typically 50–200ms before the first token, depending on your location and the provider's infrastructure. Self-hosted models on local hardware start generating in 10–50ms.
For interactive applications where time-to-first-token matters (chat interfaces, code completion, real-time suggestions), local inference has a structural advantage. For batch processing or async workloads, the latency difference is irrelevant.
Throughput
Single-user throughput is comparable. A self-hosted 8B model on an RTX 4090 generates ~128 tokens/second — faster than most cloud API streaming responses.
Multi-user throughput is where cloud APIs pull ahead. OpenAI and Anthropic run inference on massive GPU clusters with load balancing, auto-scaling, and request queuing. Replicating that with self-hosted infrastructure requires significant engineering investment in vLLM configuration, load balancing, and GPU fleet management.
Model Quality
This is the elephant in the room. The best open-weight models in 2026 (Llama 3.3 405B, DeepSeek R1, Qwen 3 235B) are competitive with GPT-4.1 and Claude Sonnet on many benchmarks. But frontier models — GPT-5, Claude Opus 4.6, Gemini Ultra — still lead on complex reasoning, long-context tasks, and instruction following.
If your use case requires frontier-level intelligence, cloud APIs are your only option. If a 70B or 8B model handles your workload well, self-hosting becomes viable.
Privacy and Compliance
This is often the strongest argument for self-hosting — and sometimes the only argument that matters.
When Self-Hosting Is Required
- Regulated industries: Healthcare (HIPAA), finance (SOX, PCI-DSS), and government (FedRAMP) may require that patient data, financial records, or classified information never leave your infrastructure.
- Data residency: Some jurisdictions require data processing to occur within national borders. Self-hosting on local infrastructure guarantees compliance.
- Air-gapped environments: Military, critical infrastructure, and some enterprise environments operate without internet access. Cloud APIs are not an option.
- Competitive sensitivity: If your prompts contain proprietary algorithms, trade secrets, or competitive intelligence, sending them to a third-party API introduces risk — even with data processing agreements.
Cloud API Privacy Guarantees
Both OpenAI and Anthropic now offer enterprise tiers with zero-data-retention policies, SOC 2 Type II compliance, and contractual guarantees that API inputs are not used for training. For many organizations, these guarantees are sufficient.
The risk calculus is: do you trust a contractual guarantee, or do you need physical control? The answer depends on your threat model, not on technology.
Our article on what vibe coding is explores how AI-first development workflows handle the security implications of AI-generated code — a related concern when deciding where your AI inference runs.
The Hybrid Approach: Best of Both Worlds
The most cost-effective architecture in 2026 for many teams is hybrid: self-host for predictable baseline load, route to cloud APIs for overflow and frontier model access.
How It Works
- Baseline traffic (predictable, high-volume, latency-sensitive) routes to self-hosted models via Ollama or vLLM.
- Overflow traffic (demand spikes beyond local GPU capacity) routes to cloud APIs automatically.
- Frontier model requests (tasks requiring GPT-5 or Claude Opus-level reasoning) always route to cloud APIs.
- Privacy-sensitive requests (containing PII, regulated data, or trade secrets) always route to self-hosted models.
Implementation Pattern
An OpenAI-compatible router sits in front of both your local vLLM instance and cloud API endpoints. Since Ollama, vLLM, and most cloud providers all expose OpenAI-compatible APIs, your application code does not need to change — the router handles model selection based on request metadata, load, and policy rules.
One fintech company reported cutting monthly AI spend from $47,000 to $8,000 (83% reduction) by moving predictable workloads to self-hosted infrastructure while keeping frontier model access through cloud APIs.
Decision Framework: When to Self-Host vs Use Cloud APIs
Use Cloud APIs When:
- Your token volume is under 2 million per day
- You need frontier model capabilities (GPT-5, Claude Opus)
- Your team lacks GPU infrastructure expertise
- Your workload is bursty and unpredictable
- You want to iterate on models without hardware commitments
- Time-to-production matters more than per-token cost
Self-Host When:
- Your token volume exceeds 10 million per day consistently
- Privacy or compliance requirements mandate on-premise processing
- Latency to first token is critical (sub-50ms)
- You run an air-gapped or restricted network environment
- A 7B–70B open-weight model meets your quality requirements
- You have (or can hire) infrastructure expertise to maintain the setup
Go Hybrid When:
- You process 3–10 million tokens per day with variable load
- You need both frontier models and privacy guarantees
- You want cost optimization without sacrificing capability
- You are building a product that serves multiple use cases with different requirements
Break-Even Analysis
The break-even point depends heavily on which cloud model you are replacing and which local model you are running. Here are representative scenarios:
| Scenario | Daily Volume | Cloud Cost/Month | Self-Host Cost/Month | Break-Even |
|---|---|---|---|---|
| Replace GPT-4.1 mini with local 8B | 5M tokens/day | ~$300 | ~$95 (RTX 4090) | Month 8 |
| Replace Claude Haiku with local 8B | 5M tokens/day | ~$450 | ~$95 (RTX 4090) | Month 5 |
| Replace GPT-4.1 with local 70B | 10M tokens/day | ~$1,500 | ~$190 (2× RTX 4090) | Month 3 |
| Replace Claude Sonnet with local 70B | 10M tokens/day | ~$2,700 | ~$190 (2× RTX 4090) | Month 2 |
These figures assume sustained daily volumes and standard (non-cached, non-batched) cloud API pricing. Real-world costs vary based on input/output token ratios, caching utilization, and hardware availability.
For medium-usage teams processing 3–5 million tokens per day, local deployment on consumer hardware typically reaches break-even against cloud API pricing in roughly 3–8 months, depending on which cloud model you are replacing. If you are also considering self-hosting your entire development stack, see our guide to self-hosting your dev stack for under $20/month.
Getting Started: Practical First Steps
If you are exploring self-hosting for the first time:
- Install Ollama (
curl -fsSL https://ollama.com/install.sh | sh) and run a small model (ollama run llama3.2). - Test it against your actual workload — not benchmarks, your real prompts.
- Compare output quality honestly against the cloud model you are considering replacing.
- If quality is acceptable, measure your actual daily token volume for a week.
- Run the break-even math with your numbers.
If you are scaling to production:
- Deploy vLLM with your chosen model on GPU infrastructure.
- Set up an OpenAI-compatible routing layer.
- Start with cloud-only, then gradually shift predictable traffic to self-hosted.
- Monitor quality, latency, and cost continuously.
If you are using AI tools in your development workflow, our Claude Code advanced workflow guide covers practical workflows for AI-assisted coding — whether the models run locally or in the cloud. For GPU server hosting, see our Hetzner Cloud AI GPU server guide.
The Bottom Line
Self-hosting LLMs in 2026 is mature, practical, and cost-effective — at the right scale. The tools are production-ready. The models are capable. The hardware is accessible.
But cloud APIs are also better and cheaper than ever. The 4.5/4.6 generation of Claude models represents a 67% cost reduction over previous generations. OpenAI's GPT-4.1 nano costs $0.10 per million input tokens. At low to moderate volumes, cloud APIs win on total cost of ownership because you never pay for idle hardware.
The right answer for most teams in 2026 is not "self-host everything" or "use cloud for everything." It is a deliberate hybrid architecture that routes each request to the most cost-effective and appropriate inference backend based on volume, quality requirements, privacy constraints, and latency needs.
Start with cloud APIs. Measure your actual usage. When the numbers justify it — and only then — add self-hosted infrastructure for your predictable, high-volume workloads. Keep cloud access for frontier models and demand spikes. That is the architecture that optimizes for cost, capability, and flexibility simultaneously.
Get weekly AI tool reviews & automation tips
Join our newsletter. No spam, unsubscribe anytime.