AI INFRASTRUCTURE ARTICLES ·2026-04-04 ·BY EFFLOOW EDITORIAL ·14 MIN READ

Gemma 4 Local Setup 2026: Run It with Ollama and Open WebUI

Run Gemma 4 locally with Ollama and Open WebUI in 2026: all four model sizes compared, hardware requirements, and a step-by-step setup guide.

gemma-4 ollama open-webui self-hosting llm ai-infrastructure google local-ai gpu hetzner

Illustration for Gemma 4 Local Setup 2026: Run It with Ollama and Open WebUI — Illustration: AI-assisted. Editorial policy

Google DeepMind released Gemma 4 on April 2, 2026. Within 48 hours, the models had crossed 207,000 pulls on Ollama, hit the front page of Hacker News, and Ollama shipped v0.20.0 with same-day support for all four model variants (source).

The hype is justified. Gemma 4 is built from the same research behind Gemini 3, released under a fully permissive Apache 2.0 license, and the 31B instruction-tuned model ranks #3 on Arena AI's text leaderboard at 1452 Elo — outperforming models twenty times its size (source).

But what actually makes Gemma 4 different from yet another open model release is the range. Four model sizes span from a 2B-parameter edge model that runs on a Raspberry Pi to a 31B dense model that competes with frontier APIs. Every size handles text and images natively. The smaller models even process audio. And the Apache 2.0 license means you can use them commercially without restrictions.

This guide covers everything: choosing the right model size for your hardware, setting up Ollama, adding a browser-based chat interface with Open WebUI, and deploying larger models on a Hetzner GPU server when your laptop is not enough. No fluff, no placeholder commands — every step has been verified.

What Is Gemma 4 and Why It Matters

Gemma 4 is Google DeepMind's fourth-generation family of open-weight language models. "Open-weight" means you get the full model weights to run locally — not just API access. The models are derived from the same research and training pipeline as Gemini 3, Google's flagship commercial model (source).

Key Features

Apache 2.0 license. Full commercial use. No usage restrictions, no registration required, no "open but not really" clauses.
Multimodal by default. All four model sizes handle text and image input. The E2B and E4B models also support audio input.
Up to 256K context window. The 26B and 31B models support 256K tokens. The E2B and E4B models support 128K tokens.
Native function calling. Built-in tool use support for agentic workflows.
Configurable thinking modes. Control whether the model shows its reasoning chain or responds directly.
140+ language support. Broad multilingual fluency across all model sizes.

The Four Model Sizes

Gemma 4 ships in four variants, each targeting different hardware and use cases:

Model	Parameters	Active Parameters	Architecture	Context Window	Download Size (Ollama)
E2B	~2.3B effective	2B	Dense (edge-optimized)	128K	~7.2 GB
E4B	~4.5B effective	4B	Dense (edge-optimized)	128K	~9.6 GB
26B A4B	26B total	3.8B active	Mixture of Experts (128 experts)	256K	~18 GB
31B	31B	31B (all active)	Dense	256K	~20 GB

The "E" in E2B and E4B stands for "effective" — these models are optimized to activate only their effective parameter count during inference, preserving RAM and battery life on edge devices. The 26B model uses a Mixture of Experts architecture where only 3.8 billion parameters activate per token, making inference speed comparable to a 4B model while quality approaches a much larger one.

Hardware Requirements — What You Actually Need

This is the section most guides get wrong. They list theoretical minimums without telling you what the experience is actually like. Here is what we found:

Minimum and Recommended Hardware

Model	Minimum RAM/VRAM	Recommended	CPU-Only Viable?	Speed Expectation
E2B	4 GB RAM	8 GB RAM	Yes	5-15 tok/s on CPU, 30+ on GPU
E4B	6 GB VRAM / 8 GB RAM	10 GB VRAM / 16 GB RAM	Usable but slow	3-10 tok/s on CPU, 25+ on GPU
26B A4B	8 GB VRAM / 16 GB RAM	12+ GB VRAM / 24 GB RAM	Very slow	1-3 tok/s on CPU, 15-25 on GPU
31B	20 GB VRAM / 32 GB RAM	24+ GB VRAM / 48 GB RAM	Not practical	<1 tok/s on CPU, 10-20 on GPU

To check whether a specific size and precision fits your GPU, our free LLM VRAM calculator gives a quick estimate.

Hardware Recommendations by Device

MacBook Air M1/M2 (8 GB unified memory) Run E2B. It fits comfortably and gives usable speeds. E4B will load but may swap to disk during long conversations.

MacBook Pro M2/M3/M4 (16-36 GB unified memory) E4B is the sweet spot. The 26B MoE model also works well on 24+ GB configurations thanks to its low active parameter count.

Desktop with NVIDIA GPU (8-12 GB VRAM) E4B with full GPU offload. The 26B MoE model fits on 12 GB cards like the RTX 4070.

Desktop with NVIDIA GPU (24 GB VRAM — RTX 3090/4090) Run the full 31B dense model. This is where Gemma 4 truly shines.

Linux server / VPS (CPU-only) E2B for real-time chat. E4B for batch processing where speed is less critical. Anything larger is impractical without a GPU.

Hetzner GPU server The 26B and 31B models run well on Hetzner's dedicated GPU servers with RTX 4000 Ada (20 GB VRAM). See the GPU deployment section below, and our full Hetzner GPU setup guide for detailed pricing and configuration.

Local Setup with Ollama — Step by Step

If you have not used Ollama before, it is a tool that downloads, manages, and runs language models locally with a single command. Think of it as Docker for LLMs. If you want the full setup walkthrough including Docker Compose and VPS deployment, see our Ollama + Open WebUI Self-Hosting Guide.

Step 1: Install Ollama

macOS:

brew install ollama

Or download the installer from ollama.com/download.

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows:

Download the installer from ollama.com/download. Ollama runs as a background service on Windows.

Verify the installation:

ollama --version
# Should show v0.20.0 or later for Gemma 4 support

Step 2: Pull and Run Your First Gemma 4 Model

The default gemma4 tag points to the E4B model. Start here unless you know you need a different size:

# Pull the E4B model (default, ~9.6 GB download)
ollama pull gemma4

# Run it
ollama run gemma4

You will see a chat prompt. Type a question and hit Enter. The model runs entirely on your machine — no API key, no internet connection needed after the initial download.

Step 3: Choose Your Model Size

Each model variant has its own tag:

# E2B — smallest, runs on almost anything
ollama pull gemma4:e2b
ollama run gemma4:e2b

# E4B — default, best balance of quality and speed
ollama pull gemma4:e4b
ollama run gemma4:e4b

# 26B MoE — sleeper pick, quality near 13B at 4B speed
ollama pull gemma4:26b
ollama run gemma4:26b

# 31B Dense — best quality, needs 24GB+ VRAM
ollama pull gemma4:31b
ollama run gemma4:31b

Step 4: Test with Different Tasks

Basic conversation:

>>> What are the main differences between REST and GraphQL?

Image analysis (multimodal):

>>> Describe this image: /path/to/screenshot.png

Code generation:

>>> Write a Python function that implements binary search on a sorted list. Include type hints and docstring.

Reasoning with thinking mode:

>>> /set parameter num_ctx 8192
>>> Think step by step: A farmer has 17 sheep. All but 9 die. How many are left?

Step 5: Configure Model Parameters

Ollama lets you tune generation parameters per session:

# Set context window (tokens)
/set parameter num_ctx 32768

# Set temperature (0.0 = deterministic, 1.0 = creative)
/set parameter temperature 0.7

# Set top_p (nucleus sampling)
/set parameter top_p 0.9

For long documents or RAG workflows, increase num_ctx. The E2B and E4B models support up to 128K tokens; the 26B and 31B support up to 256K. Keep in mind that larger context windows use more RAM.

Updating Gemma 4

When new versions or quantizations are released, update with:

ollama pull gemma4

Ollama checks for the latest version and downloads only what changed — similar to how Docker handles image layers.

Open WebUI Integration — Browser-Based Chat

Running Gemma 4 from the terminal works, but Open WebUI gives you a ChatGPT-style browser interface with conversation history, model switching, document upload, and multi-user support. If you have followed our Ollama + Open WebUI guide, you already have this running.

Quick Setup with Docker

Make sure Ollama is running first (it starts automatically on macOS after installation). Then:

docker run -d \
  --name open-webui \
  -p 3000:8080 \
  -v open-webui:/app/backend/data \
  --add-host=host.docker.internal:host-gateway \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 in your browser. Create an admin account on first launch.

Using Gemma 4 in Open WebUI

Click the model dropdown in the top-left of the chat interface.
You will see all models pulled in Ollama listed — gemma4:e4b, gemma4:26b, etc.
Select your preferred Gemma 4 variant and start chatting.

Multi-Model Comparison

One of Open WebUI's best features is side-by-side model comparison. Pull multiple Gemma 4 variants and compare them on the same prompt:

ollama pull gemma4:e4b
ollama pull gemma4:26b

In Open WebUI, enable the comparison view to see how the E4B and 26B respond to the same question. This is useful for deciding which model size fits your use case before committing to one.

Document Upload and RAG

Open WebUI supports uploading PDFs, text files, and other documents directly into the chat. Gemma 4's large context window (up to 256K on the 26B and 31B models) makes it effective for document Q&A without needing a separate RAG pipeline for shorter documents.

For production RAG setups, Open WebUI also integrates with external vector databases — but for most personal and small team use cases, the built-in document upload handles things well enough. If you want to go further, our guide on building a RAG app with LlamaIndex shows how to build a full retrieval pipeline that can use Gemma 4 as the generation model.

GPU Server Deployment on Hetzner

The E2B and E4B models run fine on consumer hardware. But if you want to run the 26B MoE or 31B Dense models with good performance — especially for team use or API serving — you need a GPU server.

Hetzner's dedicated GPU servers offer the best price-to-performance ratio for this. Their GEX44 plan with an NVIDIA RTX 4000 Ada (20 GB VRAM) starts at €184/month — roughly 75% cheaper than equivalent AWS GPU instances. See our complete Hetzner GPU setup guide for detailed pricing and server selection.

Server Setup

1. Provision a Hetzner dedicated GPU server.

Order a GEX44 or higher from Hetzner's Robot panel. Choose Ubuntu 24.04 as the OS.

2. Install NVIDIA drivers and Ollama.

# SSH into your server
ssh root@your-server-ip

# Install NVIDIA drivers (Ubuntu 24.04)
apt update && apt install -y nvidia-driver-560

# Reboot to load drivers
reboot

After reboot, verify the GPU is detected:

nvidia-smi

3. Install Ollama with GPU support.

curl -fsSL https://ollama.com/install.sh | sh

Ollama auto-detects NVIDIA GPUs. No additional configuration needed.

4. Pull and run the 31B model.

ollama pull gemma4:31b
ollama run gemma4:31b

With 20 GB VRAM on the RTX 4000 Ada, the 31B model fits with room for a reasonable context window. For the 26B MoE model, you get even more headroom since only 3.8B parameters activate at inference time.

Expose via Open WebUI

For team access, deploy Open WebUI on the same server:

docker run -d \
  --name open-webui \
  -p 3000:8080 \
  --gpus all \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://localhost:11434 \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Set up a reverse proxy (Nginx or Caddy) with HTTPS, and your team has a private, self-hosted AI chat running Gemma 4 on dedicated GPU hardware. Our self-hosted dev stack guide covers Caddy reverse proxy setup if you need it.

Cost Comparison

Setup	Monthly Cost	Model Size	Speed
MacBook Pro M3 (local)	$0 (already owned)	E4B	~25 tok/s
Hetzner CX22 VPS (CPU-only)	~€5/month	E2B	~5-10 tok/s
Hetzner GEX44 GPU server	~€184/month	31B Dense	~15-20 tok/s
AWS g5.xlarge (A10G)	~$750/month	31B Dense	~15-20 tok/s

For teams that need the 31B model running 24/7, Hetzner saves roughly €550/month compared to AWS for equivalent GPU performance.

Benchmarks — Gemma 4 vs. the Competition

Open model benchmarks shift fast. Here is where Gemma 4 stands as of April 2026, based on published results and leaderboard rankings.

Flagship Results (31B Dense)

Benchmark	Gemma 4 31B	Llama 4 Scout	Qwen 3.5 27B	DeepSeek-V3.2
Arena Elo (text)	1452 (#3)	~1420	~1430	~1460
MMLU Pro	85.2%	83.1%	84.8%	87.5%
AIME 2026 (math)	89.2%	82.5%	86.1%	92.3%
GPQA Diamond (science)	84.3%	79.8%	82.1%	86.7%
LiveCodeBench v6	80.0%	75.3%	78.4%	83.2%
Codeforces Elo	2150	1980	2050	2280

Sources: Arena AI leaderboard, ai.rs comparison, Lushbinary benchmarks

What the Benchmarks Tell You

Gemma 4 wins on efficiency. At 31B parameters, it outperforms Llama 4 Scout (which uses a much larger MoE architecture) and matches or beats Qwen 3.5 27B on most tasks. For its size, it is the strongest open model available.

DeepSeek still leads on raw reasoning. If you need the absolute best performance on complex math, competitive coding, and chain-of-thought reasoning, DeepSeek-V3.2 remains ahead. But it is also significantly larger and more expensive to run.

The 26B MoE is the real story. With only 3.8B active parameters, the 26B model delivers quality close to the 31B dense model at a fraction of the compute cost. This is the model that most developers should try first on capable hardware.

E2B and E4B have no real competition at their size. The E2B with native multimodal support (text + image + audio) and 128K context in a 2B-parameter model has no equivalent in the Llama 4 or Qwen 3.5 families. These are genuinely new capabilities at this size tier.

An Honest Assessment

Gemma 4 is not the best open model at everything. It trails Chinese competitors (DeepSeek, Qwen) on deep reasoning benchmarks. Its 31B flagship is competitive but not dominant. Where Gemma 4 genuinely excels is the breadth of its lineup — four sizes covering everything from edge devices to workstations — and the multimodal capabilities baked into every variant.

For most local AI use cases (chat, code completion, document Q&A, simple agentic tasks), Gemma 4 is more than capable. For research-grade reasoning or competitive coding, you may want to pair it with DeepSeek or Qwen for those specific tasks.

When to skip Gemma 4

Skip it for top-tier reasoning or competitive coding. The benchmark table above shows DeepSeek-V3.2 ahead on AIME, GPQA, and Codeforces. If that is your workload, run DeepSeek and accept the larger hardware bill.
Skip local hosting entirely if your volume is low. Below roughly a few hundred requests a day, a hosted API costs less than the GPU you would buy or rent to serve Gemma 4 yourself — the hardware only pays back on sustained use.
Skip the 31B dense model on consumer hardware. If you have a single 24GB card, the 26B MoE (3.8B active) gives you near-flagship quality without the memory wall; reaching for the 31B dense variant here just causes offloading and slow tokens.

Use Cases — Where Gemma 4 Fits in Your Workflow

Coding Assistant

Use Gemma 4 as a local code completion and generation model in your IDE. The 31B model handles multi-file refactoring and architectural decisions well. The E4B handles function-level generation and explanations. Pair it with tools like Continue.dev or your IDE's Ollama integration.

If you are building a free AI coding stack, Gemma 4 via Ollama is one of the strongest local model options available — it costs nothing to run and works offline. You can also run Gemma 4 through Docker Model Runner if Docker is already part of your workflow.

Private Chat Interface

Run Open WebUI with Gemma 4 for a completely private ChatGPT alternative. No data leaves your machine. This is especially valuable for conversations involving proprietary code, confidential business information, or personal data.

RAG and Document Q&A

Gemma 4's 256K context window (on 26B and 31B) means you can feed in entire documents without chunking for many use cases. For larger document sets, pair it with a vector database through Open WebUI's RAG integration.

Embeddings

While Gemma 4 is primarily a generative model, the E2B variant works as a lightweight embedding model for local search and similarity applications. For dedicated embedding tasks, you may still prefer specialized models like nomic-embed-text, but Gemma 4 can handle both generation and basic embedding in a single model.

Edge and Mobile Deployment

The E2B model is explicitly designed for on-device deployment. Google has announced Gemma 4 support in Android AICore for on-device inference (source), and NVIDIA has published acceleration guides for running Gemma 4 on RTX hardware (source).

Agentic Workflows

Gemma 4's native function calling support makes it suitable for agentic workflows where the model needs to invoke tools, query databases, or call APIs. The 26B and 31B models handle multi-step reasoning chains reliably.

Troubleshooting Common Issues

Model fails to load or crashes

Cause: Not enough RAM or VRAM. Ollama will try to load the model and fail silently or crash.

Fix: Drop to a smaller model variant. If E4B crashes on 8 GB RAM, use E2B instead.

ollama run gemma4:e2b

Very slow generation (< 1 tok/s)

Cause: Model is running on CPU when it should be on GPU, or the model is too large for available memory and is swapping.

Fix: Check if Ollama detects your GPU:

ollama ps

If GPU is not listed, ensure NVIDIA drivers are installed (nvidia-smi should show your card). On macOS with Apple Silicon, GPU acceleration is automatic.

Open WebUI shows no models

Cause: Open WebUI cannot reach the Ollama server.

Fix: Ensure Ollama is running and check the OLLAMA_BASE_URL environment variable in your Docker run command. For Docker Desktop on Mac/Windows, use http://host.docker.internal:11434. For Linux, use http://localhost:11434 if running on the same machine, or use --network host.

Context window runs out mid-conversation

Cause: Default context window in Ollama is 2048 tokens. Gemma 4 supports much more but you need to set it.

Fix:

# In the Ollama chat, increase context
/set parameter num_ctx 32768

# Or when starting the model
ollama run gemma4 --num-ctx 32768

Higher context uses more memory. Scale according to your available RAM/VRAM.

Quick Reference Card

Task	Command
Install Ollama (macOS)	`brew install ollama`
Install Ollama (Linux)	`curl -fsSL https://ollama.com/install.sh \| sh`
Pull default model (E4B)	`ollama pull gemma4`
Pull specific variant	`ollama pull gemma4:e2b` / `gemma4:26b` / `gemma4:31b`
Run model	`ollama run gemma4`
List downloaded models	`ollama list`
Check running models	`ollama ps`
Update model	`ollama pull gemma4` (re-pull)
Set context window	`/set parameter num_ctx 32768`
Run Open WebUI	`docker run -d -p 3000:8080 ...` (see above)

What Effloow Added

Google's announcement tells you Gemma 4 exists; Ollama's docs tell you how to pull a model. Neither answers the question a developer actually has: which of the four variants do I run on the hardware I own, and is it good enough to skip the API? What we added is the connective work.

A hardware-to-variant sizing table that maps each Gemma 4 size to the GPU memory it actually needs, so the choice is a lookup instead of trial and error.
A normalized benchmark matrix placing Gemma 4 against Llama 4, Qwen 3.5, and DeepSeek-V3.2 with sources attached, plus an honest read of where it loses (deep reasoning, competitive coding).
A single tested setup path from ollama pull through Open WebUI and a Hetzner GPU box, with a skip-list so you don't self-host when an API is cheaper.

The value is the decision, not the model: we turned four official sources into a "which variant, on what hardware, and when not to bother" answer.

Tools you can use

Free tool

LLM Cost Calculator — API vs Self-Hosting Break-Even

Free calculator: compare the monthly cost of an LLM API against self-hosting on your own GPU, and find the token volume where self-hosting starts to win.

Open tool →

What Is Gemma 4 and Why It Matters#

Key Features#

The Four Model Sizes#

Hardware Requirements — What You Actually Need#

Minimum and Recommended Hardware#

Hardware Recommendations by Device#

Local Setup with Ollama — Step by Step#

Step 1: Install Ollama#

Step 2: Pull and Run Your First Gemma 4 Model#

Step 3: Choose Your Model Size#

Step 4: Test with Different Tasks#

Step 5: Configure Model Parameters#

Updating Gemma 4#

Open WebUI Integration — Browser-Based Chat#

Quick Setup with Docker#

Using Gemma 4 in Open WebUI#

Multi-Model Comparison#

Document Upload and RAG#

GPU Server Deployment on Hetzner#

Server Setup#

Expose via Open WebUI#

Cost Comparison#

Benchmarks — Gemma 4 vs. the Competition#

Flagship Results (31B Dense)#

What the Benchmarks Tell You#

An Honest Assessment#

When to skip Gemma 4#

Use Cases — Where Gemma 4 Fits in Your Workflow#

Coding Assistant#

Private Chat Interface#

RAG and Document Q&A#

Embeddings#

Edge and Mobile Deployment#

Agentic Workflows#

Troubleshooting Common Issues#

Model fails to load or crashes#

Very slow generation (< 1 tok/s)#

Open WebUI shows no models#

Context window runs out mid-conversation#

Quick Reference Card#

What Effloow Added#

What to Read Next#

Get the next onein your inbox.

Get weekly AI tool reviews & automation tips

More in Articles