AI INFRASTRUCTURE ARTICLES ·2026-04-07 ·BY EFFLOOW EDITORIAL ·11 MIN READ

Self-Hosting LLMs vs Cloud APIs: Cost, Speed, Privacy 2026

Self-hosting LLMs with Ollama, vLLM, and llama.cpp vs cloud APIs: cost-per-token modeling, hardware needs, latency, and when each approach wins.

self-hosting llm-inference ollama vllm llama-cpp cloud-apis ai-infrastructure cost-comparison gpu-hardware privacy

Illustration for Self-Hosting LLMs vs Cloud APIs: Cost, Speed, Privacy 2026 — Illustration: AI-assisted. Editorial policy

The question used to be simple: can you even run a useful LLM locally? In 2026, the answer is definitively yes. Open-weight models like Llama 3.3, Qwen 3, DeepSeek R1, and Mistral Large rival proprietary models on many benchmarks. Consumer GPUs have enough VRAM to run 70B-parameter models. Tools like Ollama make local inference as easy as pulling a Docker image.

But "can" and "should" are different questions. Cloud APIs from OpenAI, Anthropic, and Google keep getting cheaper, faster, and more capable. The real decision in 2026 is not about possibility — it is about economics, performance requirements, and privacy constraints.

This guide breaks down the actual numbers. No hand-waving, no vendor hype — just a practical cost-per-token comparison, hardware requirements, and a framework for deciding which approach fits your workload.

If you are building with AI coding tools specifically, our comparison of terminal AI coding agents covers which agents use local vs cloud inference under the hood. For a full pricing breakdown, see our AI coding tools pricing comparison.

How Cloud API Pricing Works in 2026

Cloud LLM providers charge per token — typically quoted per million input tokens and per million output tokens. The spread between models is enormous, so "cloud API pricing" is not one number.

Current Pricing Snapshot

Prices below are per 1 million tokens as listed on the official pricing pages (OpenAI, Anthropic). These change frequently — always verify on those pages before making infrastructure decisions.

OpenAI (as of April 2026):

Model	Input (per 1M tokens)	Output (per 1M tokens)
GPT-4.1	$2.00	$8.00
GPT-4.1 mini	$0.40	$1.60
GPT-4.1 nano	$0.10	$0.40
GPT-5	$1.25	$10.00
o3	$2.00	$8.00
o3-mini	$1.10	$4.40
o4-mini	$1.10	$4.40

Anthropic (as of April 2026):

Model	Input (per 1M tokens)	Output (per 1M tokens)
Claude Haiku 4.5	$1.00	$5.00
Claude Sonnet 4.6	$3.00	$15.00
Claude Opus 4.6	$5.00	$25.00

Both providers offer significant discounts through caching and batching. Anthropic's prompt caching reduces input costs by roughly 90% for repeated context. Their Batch API cuts all token costs by 50% for non-real-time workloads. Combined, these optimizations can reduce effective costs by up to 95% for the right use cases. On the OpenAI side we measured when the 24-hour prompt cache discount actually applies, because a cache you miss quietly bills at full price.

The takeaway: If you are making occasional API calls or processing under 2 million tokens per day, cloud APIs are almost certainly cheaper than any self-hosted setup. You pay nothing when idle.

The True Cost of Self-Hosting

Self-hosting sounds free after you buy the hardware. It is not. The real cost includes hardware amortization, electricity, cooling, maintenance time, and opportunity cost. Here is what the numbers actually look like.

Hardware Requirements by Model Size

Model Size	Minimum VRAM (Q4 Quantized)	Recommended GPU	Approximate GPU Cost
7–8B parameters	6 GB	RTX 4060 Ti (16 GB)	$400–$500
13B parameters	10 GB	RTX 4070 Ti (16 GB)	$700–$800
30–34B parameters	20 GB	RTX 4090 (24 GB)	$1,600–$2,000
70B parameters	40 GB	2× RTX 4090 or RTX 5090 (32 GB)	$2,000–$4,000
70B parameters (FP16)	140 GB	2× A100 (80 GB)	$20,000+
405B parameters	200+ GB	4× A100 or 8× RTX 4090	$50,000+

The table above assumes Q4 quantization. To estimate VRAM for a different precision or for fine-tuning with LoRA or QLoRA rather than inference, use our free LLM VRAM calculator — it shows the math and suggests a GPU that fits.

The RTX 5090, released in January 2025, adds 32 GB of GDDR7 memory with 1.79 TB/s bandwidth — a meaningful upgrade over the RTX 4090's 24 GB and roughly 1.0 TB/s. Published benchmarks put it in the range of 140–210 tokens/second on 8B models at Q4 quantization, roughly 2–3× the RTX 4090 depending on the model and inference engine (Markaicode RTX 5090 benchmark). Most of that gain comes from memory bandwidth — LLM token generation is bandwidth-bound, not compute-bound, at this scale. But at $2,000–$3,800 with limited availability, it is not always easy to buy.

Running Costs

Electricity is often underestimated. A single RTX 4090 under sustained inference load draws 350–450W. At the US average of $0.16/kWh:

24/7 operation: ~$40–$55 per month per GPU
8 hours/day operation: ~$13–$18 per month per GPU

Add $5–$15/month for cooling overhead in most setups. For a dual-GPU rig running 70B models around the clock, expect $90–$130/month in electricity alone.

Amortized Cost Per Token

Here is where the math gets interesting. Assuming a 3-year hardware amortization:

Setup	Hardware Cost	Monthly Amortization	Monthly Electricity	Throughput	Effective Cost per 1M Tokens
RTX 4090 + 8B model (Q4)	$1,800	$50	$45	~330M tokens/day	~$0.009
RTX 5090 + 30B model (Q4)	$3,000	$83	$55	~160M tokens/day	~$0.028
2× RTX 4090 + 70B model (Q4)	$3,600	$100	$90	~80M tokens/day	~$0.079

Compare that to GPT-4.1 mini at $0.40/$1.60 per million tokens, or Claude Haiku 4.5 at $1.00/$5.00. On raw per-token cost at scale, self-hosting wins — but only if your GPUs are actually busy.

The hidden cost of idle hardware: If your workload averages 1 million tokens per day on a rig capable of 300 million, you are paying $95/month in fixed costs for a workload that would cost $1–$5/month on a cloud API.

Self-Hosting Tools: Ollama vs vLLM vs llama.cpp

Three tools dominate the self-hosting landscape in 2026. Each targets a different use case.

Ollama — The Developer's Default

Ollama wraps llama.cpp in a Go-based server with a Docker-like experience. One command pulls and runs models with an OpenAI-compatible API endpoint.

Best for: Development, prototyping, personal use, privacy-focused workflows, air-gapped environments. For a complete setup walkthrough, see our Ollama + Open WebUI self-hosting guide.

Strengths:

Zero-configuration setup — ollama run llama3.3 and you are running
Automatic quantization and GPU detection
OpenAI-compatible API (drop-in replacement in most SDKs)
The 0.17 series (early 2026) added cloud model offloading, web search, multimodal support, streaming tool calls, and thinking models
Native tool calling support for external API integration

Limitations:

Caps at ~4 concurrent requests by default
~62 tok/s on Llama 3.1 8B (Q4_K_M quantization)
Not designed for production-scale multi-user serving

vLLM — The Production Choice

vLLM's PagedAttention engine manages GPU memory like an operating system manages RAM — paging model weights in and out to maximize throughput under concurrent load.

Best for: Production APIs, multi-user serving, high-throughput batch processing.

Strengths:

Continuous batching aggregates concurrent requests into unified GPU operations
Published benchmarks show vLLM sustaining roughly 485 tok/s across 10 concurrent Llama 3.1 8B requests — on the order of 16× Ollama's aggregate throughput once concurrency climbs past 10 users
In a production-serving comparison on a data-center GPU, vLLM produced about 44× more tokens/second than llama.cpp at 64 concurrent users, where llama.cpp's queued requests pushed P99 time-to-first-token past three minutes (Red Hat Developer, TechPlained benchmark)
Speculative decoding support for faster generation
OpenAI-compatible serving endpoint

Limitations:

More complex setup than Ollama
Requires NVIDIA GPUs (no CPU-only mode)
Higher minimum memory overhead

llama.cpp — The Embedded Option

The pure C/C++ inference engine with no external dependencies. It runs everywhere — from data center GPUs to Android phones.

Best for: Embedded applications, edge deployment, mobile devices, maximum hardware compatibility. For running models on dedicated NPUs and on-device accelerators, see our on-device AI inference guide.

Strengths:

Runs on CPU, Apple Silicon, NVIDIA, AMD, and mobile devices
Smallest footprint of the three
4-bit quantization enables Llama 3.2 3B on standard Android devices
Direct library embedding into native applications

Limitations:

No built-in serving layer (you add your own HTTP server)
Lower throughput than vLLM under concurrent load
More manual configuration required

Quick Decision Matrix

Scenario	Recommended Tool
Developer testing locally	Ollama
Team staging server	llama.cpp or Ollama
Production user-facing API	vLLM
Mobile/embedded deployment	llama.cpp
Privacy-first personal assistant	Ollama

For context on how these tools connect to the broader AI tooling ecosystem, see our guide to MCP (Model Context Protocol) — the emerging standard for how AI tools integrate with external services. You may also want to compare Docker Model Runner vs Ollama for container-based local AI deployment.

Performance: Latency, Throughput, and Quality

Latency

Cloud APIs add network round-trip time — typically 50–200ms before the first token, depending on your location and the provider's infrastructure. Self-hosted models on local hardware start generating in 10–50ms.

For interactive applications where time-to-first-token matters (chat interfaces, code completion, real-time suggestions), local inference has a structural advantage. For batch processing or async workloads, the latency difference is irrelevant.

Throughput

Single-user throughput is comparable. A self-hosted 8B model on an RTX 4090 generates ~128 tokens/second — faster than most cloud API streaming responses.

Multi-user throughput is where cloud APIs pull ahead. OpenAI and Anthropic run inference on massive GPU clusters with load balancing, auto-scaling, and request queuing. Replicating that with self-hosted infrastructure requires significant engineering investment in vLLM configuration, load balancing, and GPU fleet management.

Model Quality

This is the elephant in the room. The best open-weight models in 2026 (Llama 3.1 405B, DeepSeek R1, Qwen 3 235B) are competitive with GPT-4.1 and Claude Sonnet on many benchmarks. But frontier models — GPT-5, Claude Opus 4.6, Gemini Ultra — still lead on complex reasoning, long-context tasks, and instruction following.

If your use case requires frontier-level intelligence, cloud APIs are your only option. If a 70B or 8B model handles your workload well, self-hosting becomes viable.

Privacy and Compliance

This is often the strongest argument for self-hosting — and sometimes the only argument that matters.

When Self-Hosting Is Required

Regulated industries: Healthcare (HIPAA), finance (SOX, PCI-DSS), and government (FedRAMP) may require that patient data, financial records, or classified information never leave your infrastructure.
Data residency: Some jurisdictions require data processing to occur within national borders. Self-hosting on local infrastructure guarantees compliance.
Air-gapped environments: Military, critical infrastructure, and some enterprise environments operate without internet access. Cloud APIs are not an option.
Competitive sensitivity: If your prompts contain proprietary algorithms, trade secrets, or competitive intelligence, sending them to a third-party API introduces risk — even with data processing agreements.

Cloud API Privacy Guarantees

Both OpenAI and Anthropic now offer enterprise tiers with zero-data-retention policies, SOC 2 Type II compliance, and contractual guarantees that API inputs are not used for training. For many organizations, these guarantees are sufficient.

The risk calculus is: do you trust a contractual guarantee, or do you need physical control? The answer depends on your threat model, not on technology.

Our article on what vibe coding is explores how AI-first development workflows handle the security implications of AI-generated code — a related concern when deciding where your AI inference runs.

The Hybrid Approach: Best of Both Worlds

The most cost-effective architecture in 2026 for many teams is hybrid: self-host for predictable baseline load, route to cloud APIs for overflow and frontier model access.

How It Works

Baseline traffic (predictable, high-volume, latency-sensitive) routes to self-hosted models via Ollama or vLLM.
Overflow traffic (demand spikes beyond local GPU capacity) routes to cloud APIs automatically.
Frontier model requests (tasks requiring GPT-5 or Claude Opus-level reasoning) always route to cloud APIs.
Privacy-sensitive requests (containing PII, regulated data, or trade secrets) always route to self-hosted models.

Implementation Pattern

An OpenAI-compatible router sits in front of both your local vLLM instance and cloud API endpoints. Since Ollama, vLLM, and most cloud providers all expose OpenAI-compatible APIs, your application code does not need to change — the router handles model selection based on request metadata, load, and policy rules.

The savings come from keeping expensive GPUs busy. Move steady, high-volume traffic onto self-hosted models and the per-token figures in the break-even table above start working in your favor; reserve cloud APIs for spikes and frontier-model calls and you keep that capability without paying for it around the clock. The math collapses the moment those GPUs sit idle — which is why measuring real daily volume comes before buying hardware.

Decision Framework: When to Self-Host vs Use Cloud APIs

Use Cloud APIs When:

Your token volume is under 2 million per day
You need frontier model capabilities (GPT-5, Claude Opus)
Your team lacks GPU infrastructure expertise
Your workload is bursty and unpredictable
You want to iterate on models without hardware commitments
Time-to-production matters more than per-token cost

Self-Host When:

Your token volume exceeds 10 million per day consistently
Privacy or compliance requirements mandate on-premise processing
Latency to first token is critical (sub-50ms)
You run an air-gapped or restricted network environment
A 7B–70B open-weight model meets your quality requirements
You have (or can hire) infrastructure expertise to maintain the setup

Go Hybrid When:

You process 3–10 million tokens per day with variable load
You need both frontier models and privacy guarantees
You want cost optimization without sacrificing capability
You are building a product that serves multiple use cases with different requirements

Break-Even Analysis

The break-even point depends heavily on which cloud model you are replacing and which local model you are running. Here are representative scenarios:

Scenario	Daily Volume	Cloud Cost/Month	Self-Host Cost/Month	Break-Even
Replace GPT-4.1 mini with local 8B	5M tokens/day	~$300	~$95 (RTX 4090)	Month 8
Replace Claude Haiku with local 8B	5M tokens/day	~$450	~$95 (RTX 4090)	Month 5
Replace GPT-4.1 with local 70B	10M tokens/day	~$1,500	~$190 (2× RTX 4090)	Month 3
Replace Claude Sonnet with local 70B	10M tokens/day	~$2,700	~$190 (2× RTX 4090)	Month 2

These figures assume sustained daily volumes and standard (non-cached, non-batched) cloud API pricing. Real-world costs vary based on input/output token ratios, caching utilization, and hardware availability.

For medium-usage teams processing 3–5 million tokens per day, local deployment on consumer hardware typically reaches break-even against cloud API pricing in roughly 3–8 months, depending on which cloud model you are replacing. To run these numbers with your own volume and prices, try our free API vs self-hosting cost calculator — it shows the monthly cost each way and the exact break-even token volume.

Getting Started: Practical First Steps

If you are exploring self-hosting for the first time:

Install Ollama (curl -fsSL https://ollama.com/install.sh | sh) and run a small model (ollama run llama3.2).
Test it against your actual workload — not benchmarks, your real prompts.
Compare output quality honestly against the cloud model you are considering replacing.
If quality is acceptable, measure your actual daily token volume for a week.
Run the break-even math with your numbers.

If you are scaling to production:

Deploy vLLM with your chosen model on GPU infrastructure.
Set up an OpenAI-compatible routing layer.
Start with cloud-only, then gradually shift predictable traffic to self-hosted.
Monitor quality, latency, and cost continuously.

If you are using AI tools in your development workflow, our Claude Code advanced workflow guide covers practical workflows for AI-assisted coding — whether the models run locally or in the cloud. For GPU server hosting, see our Hetzner Cloud AI GPU server guide.

What Effloow Added

"Self-host or use the API" gets answered with vibes far too often. We answered it with the math, and built a tool so you can run your own numbers:

A break-even table that names the crossover point per workload (e.g. replacing GPT-4.1 mini with a local 8B pays back around month 8 at 5M tokens/day), with the cloud prices linked to the official OpenAI and Anthropic pages and flagged to re-verify.
A full cost model for self-hosting — not just the GPU, but power, idle time, and the engineering hours the sticker price hides — so the comparison is honest.
An interactive API vs self-hosting cost calculator so the decision uses your token volume and hardware, not our example numbers.

The value is a break-even you can compute for your own workload, plus a decision framework and a quality caveat (frontier models still lead), not a generic "it depends."

The Bottom Line

Self-hosting LLMs in 2026 is mature, practical, and cost-effective — at the right scale. The tools are production-ready. The models are capable. The hardware is accessible.

But cloud APIs are also better and cheaper than ever. The 4.5/4.6 generation of Claude models represents a 67% cost reduction over previous generations. OpenAI's GPT-4.1 nano costs $0.10 per million input tokens. At low to moderate volumes, cloud APIs win on total cost of ownership because you never pay for idle hardware.

The right answer for most teams in 2026 is not "self-host everything" or "use cloud for everything." It is a deliberate hybrid architecture that routes each request to the most cost-effective and appropriate inference backend based on volume, quality requirements, privacy constraints, and latency needs.

Start with cloud APIs. Measure your actual usage. When the numbers justify it — and only then — add self-hosted infrastructure for your predictable, high-volume workloads. Keep cloud access for frontier models and demand spikes. That is the architecture that optimizes for cost, capability, and flexibility simultaneously.

Get the next one
in your inbox.

One short weekly dispatch with new guides, tools, and what we tested. No spam, unsubscribe anytime.

Get weekly AI tool reviews & automation tips

Join our newsletter. No spam, unsubscribe anytime.

Tools you can use

Free tool

LLM Cost Calculator — API vs Self-Hosting Break-Even

Free calculator: compare the monthly cost of an LLM API against self-hosting on your own GPU, and find the token volume where self-hosting starts to win.

Open tool →

Free tool

LLM VRAM Calculator — GPU Memory for Inference, LoRA & Fine-Tuning

Free calculator: estimate the GPU VRAM needed to run or fine-tune any LLM. Choose model size, precision, and mode (inference, LoRA, QLoRA, full fine-tune).

Open tool →

How Cloud API Pricing Works in 2026#

Current Pricing Snapshot#

The True Cost of Self-Hosting#

Hardware Requirements by Model Size#

Running Costs#

Amortized Cost Per Token#

Self-Hosting Tools: Ollama vs vLLM vs llama.cpp#

Ollama — The Developer's Default#

vLLM — The Production Choice#

llama.cpp — The Embedded Option#

Quick Decision Matrix#

Performance: Latency, Throughput, and Quality#

Latency#

Throughput#

Model Quality#

Privacy and Compliance#

When Self-Hosting Is Required#

Cloud API Privacy Guarantees#

The Hybrid Approach: Best of Both Worlds#

How It Works#

Implementation Pattern#

Decision Framework: When to Self-Host vs Use Cloud APIs#

Use Cloud APIs When:#

Self-Host When:#

Go Hybrid When:#

Break-Even Analysis#

Getting Started: Practical First Steps#

What Effloow Added#

The Bottom Line#

Get the next onein your inbox.

Get weekly AI tool reviews & automation tips

More in Articles

Tools you can use

Stay in the loop

How Cloud API Pricing Works in 2026

Current Pricing Snapshot

The True Cost of Self-Hosting

Hardware Requirements by Model Size

Running Costs

Amortized Cost Per Token

Self-Hosting Tools: Ollama vs vLLM vs llama.cpp

Ollama — The Developer's Default

vLLM — The Production Choice

llama.cpp — The Embedded Option

Quick Decision Matrix

Performance: Latency, Throughput, and Quality

Latency

Throughput

Model Quality

Privacy and Compliance

When Self-Hosting Is Required

Cloud API Privacy Guarantees

The Hybrid Approach: Best of Both Worlds

How It Works

Implementation Pattern

Decision Framework: When to Self-Host vs Use Cloud APIs

Use Cloud APIs When:

Self-Host When:

Go Hybrid When:

Break-Even Analysis

Getting Started: Practical First Steps

What Effloow Added

The Bottom Line

Get the next one
in your inbox.