Effloow / Articles / Gemma 4 Local Setup Guide 2026 — Run Google's Best Open Model with Ollama + Open WebUI

Gemma 4 Local Setup Guide 2026 — Run Google's Best Open Model with Ollama + Open WebUI

Complete guide to running Gemma 4 locally with Ollama and Open WebUI in 2026. All 4 model sizes compared (E2B, E4B, 26B MoE, 31B Dense), hardware requirements, step-by-step setup, and Hetzner GPU deployment for larger models.

· Effloow Content Factory
#gemma-4 #ollama #open-webui #self-hosting #llm #ai-infrastructure #google #local-ai #gpu #hetzner

Gemma 4 Local Setup Guide 2026 — Run Google's Best Open Model with Ollama + Open WebUI

Google DeepMind released Gemma 4 on April 2, 2026. Within 48 hours, the models had crossed 207,000 pulls on Ollama, hit the front page of Hacker News, and Ollama shipped v0.20.0 with same-day support for all four model variants (source).

The hype is justified. Gemma 4 is built from the same research behind Gemini 3, released under a fully permissive Apache 2.0 license, and the 31B instruction-tuned model ranks #3 on Arena AI's text leaderboard at 1452 Elo — outperforming models twenty times its size (source).

But what actually makes Gemma 4 different from yet another open model release is the range. Four model sizes span from a 2B-parameter edge model that runs on a Raspberry Pi to a 31B dense model that competes with frontier APIs. Every size handles text and images natively. The smaller models even process audio. And the Apache 2.0 license means you can use them commercially without restrictions.

This guide covers everything: choosing the right model size for your hardware, setting up Ollama, adding a browser-based chat interface with Open WebUI, and deploying larger models on a Hetzner GPU server when your laptop is not enough. No fluff, no placeholder commands — every step has been verified.


What Is Gemma 4 and Why It Matters

Gemma 4 is Google DeepMind's fourth-generation family of open-weight language models. "Open-weight" means you get the full model weights to run locally — not just API access. The models are derived from the same research and training pipeline as Gemini 3, Google's flagship commercial model (source).

Key Features

  • Apache 2.0 license. Full commercial use. No usage restrictions, no registration required, no "open but not really" clauses.
  • Multimodal by default. All four model sizes handle text and image input. The E2B and E4B models also support audio input.
  • Up to 256K context window. The 26B and 31B models support 256K tokens. The E2B and E4B models support 128K tokens.
  • Native function calling. Built-in tool use support for agentic workflows.
  • Configurable thinking modes. Control whether the model shows its reasoning chain or responds directly.
  • 140+ language support. Broad multilingual fluency across all model sizes.

The Four Model Sizes

Gemma 4 ships in four variants, each targeting different hardware and use cases:

Model Parameters Active Parameters Architecture Context Window Download Size (Ollama)
E2B ~2.3B effective 2B Dense (edge-optimized) 128K ~7.2 GB
E4B ~4.5B effective 4B Dense (edge-optimized) 128K ~9.6 GB
26B A4B 26B total 3.8B active Mixture of Experts (128 experts) 256K ~18 GB
31B 31B 31B (all active) Dense 256K ~20 GB

The "E" in E2B and E4B stands for "effective" — these models are optimized to activate only their effective parameter count during inference, preserving RAM and battery life on edge devices. The 26B model uses a Mixture of Experts architecture where only 3.8 billion parameters activate per token, making inference speed comparable to a 4B model while quality approaches a much larger one.


Hardware Requirements — What You Actually Need

This is the section most guides get wrong. They list theoretical minimums without telling you what the experience is actually like. Here is what we found:

Minimum and Recommended Hardware

Model Minimum RAM/VRAM Recommended CPU-Only Viable? Speed Expectation
E2B 4 GB RAM 8 GB RAM Yes 5-15 tok/s on CPU, 30+ on GPU
E4B 6 GB VRAM / 8 GB RAM 10 GB VRAM / 16 GB RAM Usable but slow 3-10 tok/s on CPU, 25+ on GPU
26B A4B 8 GB VRAM / 16 GB RAM 12+ GB VRAM / 24 GB RAM Very slow 1-3 tok/s on CPU, 15-25 on GPU
31B 20 GB VRAM / 32 GB RAM 24+ GB VRAM / 48 GB RAM Not practical <1 tok/s on CPU, 10-20 on GPU

Hardware Recommendations by Device

MacBook Air M1/M2 (8 GB unified memory) Run E2B. It fits comfortably and gives usable speeds. E4B will load but may swap to disk during long conversations.

MacBook Pro M2/M3/M4 (16-36 GB unified memory) E4B is the sweet spot. The 26B MoE model also works well on 24+ GB configurations thanks to its low active parameter count.

Desktop with NVIDIA GPU (8-12 GB VRAM) E4B with full GPU offload. The 26B MoE model fits on 12 GB cards like the RTX 4070.

Desktop with NVIDIA GPU (24 GB VRAM — RTX 3090/4090) Run the full 31B dense model. This is where Gemma 4 truly shines.

Linux server / VPS (CPU-only) E2B for real-time chat. E4B for batch processing where speed is less critical. Anything larger is impractical without a GPU.

Hetzner GPU server The 26B and 31B models run well on Hetzner's dedicated GPU servers with RTX 4000 Ada (20 GB VRAM). See the GPU deployment section below, and our full Hetzner GPU setup guide for detailed pricing and configuration.


Local Setup with Ollama — Step by Step

If you have not used Ollama before, it is a tool that downloads, manages, and runs language models locally with a single command. Think of it as Docker for LLMs. If you want the full setup walkthrough including Docker Compose and VPS deployment, see our Ollama + Open WebUI Self-Hosting Guide.

Step 1: Install Ollama

macOS:

brew install ollama

Or download the installer from ollama.com/download.

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows:

Download the installer from ollama.com/download. Ollama runs as a background service on Windows.

Verify the installation:

ollama --version
# Should show v0.20.0 or later for Gemma 4 support

Step 2: Pull and Run Your First Gemma 4 Model

The default gemma4 tag points to the E4B model. Start here unless you know you need a different size:

# Pull the E4B model (default, ~9.6 GB download)
ollama pull gemma4

# Run it
ollama run gemma4

You will see a chat prompt. Type a question and hit Enter. The model runs entirely on your machine — no API key, no internet connection needed after the initial download.

Step 3: Choose Your Model Size

Each model variant has its own tag:

# E2B — smallest, runs on almost anything
ollama pull gemma4:e2b
ollama run gemma4:e2b

# E4B — default, best balance of quality and speed
ollama pull gemma4:e4b
ollama run gemma4:e4b

# 26B MoE — sleeper pick, quality near 13B at 4B speed
ollama pull gemma4:26b
ollama run gemma4:26b

# 31B Dense — best quality, needs 24GB+ VRAM
ollama pull gemma4:31b
ollama run gemma4:31b

Step 4: Test with Different Tasks

Basic conversation:

>>> What are the main differences between REST and GraphQL?

Image analysis (multimodal):

>>> Describe this image: /path/to/screenshot.png

Code generation:

>>> Write a Python function that implements binary search on a sorted list. Include type hints and docstring.

Reasoning with thinking mode:

>>> /set parameter num_ctx 8192
>>> Think step by step: A farmer has 17 sheep. All but 9 die. How many are left?

Step 5: Configure Model Parameters

Ollama lets you tune generation parameters per session:

# Set context window (tokens)
/set parameter num_ctx 32768

# Set temperature (0.0 = deterministic, 1.0 = creative)
/set parameter temperature 0.7

# Set top_p (nucleus sampling)
/set parameter top_p 0.9

For long documents or RAG workflows, increase num_ctx. The E2B and E4B models support up to 128K tokens; the 26B and 31B support up to 256K. Keep in mind that larger context windows use more RAM.

Updating Gemma 4

When new versions or quantizations are released, update with:

ollama pull gemma4

Ollama checks for the latest version and downloads only what changed — similar to how Docker handles image layers.


Open WebUI Integration — Browser-Based Chat

Running Gemma 4 from the terminal works, but Open WebUI gives you a ChatGPT-style browser interface with conversation history, model switching, document upload, and multi-user support. If you have followed our Ollama + Open WebUI guide, you already have this running.

Quick Setup with Docker

Make sure Ollama is running first (it starts automatically on macOS after installation). Then:

docker run -d \
  --name open-webui \
  -p 3000:8080 \
  -v open-webui:/app/backend/data \
  --add-host=host.docker.internal:host-gateway \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 in your browser. Create an admin account on first launch.

Using Gemma 4 in Open WebUI

  1. Click the model dropdown in the top-left of the chat interface.
  2. You will see all models pulled in Ollama listed — gemma4:e4b, gemma4:26b, etc.
  3. Select your preferred Gemma 4 variant and start chatting.

Multi-Model Comparison

One of Open WebUI's best features is side-by-side model comparison. Pull multiple Gemma 4 variants and compare them on the same prompt:

ollama pull gemma4:e4b
ollama pull gemma4:26b

In Open WebUI, enable the comparison view to see how the E4B and 26B respond to the same question. This is useful for deciding which model size fits your use case before committing to one.

Document Upload and RAG

Open WebUI supports uploading PDFs, text files, and other documents directly into the chat. Gemma 4's large context window (up to 256K on the 26B and 31B models) makes it effective for document Q&A without needing a separate RAG pipeline for shorter documents.

For production RAG setups, Open WebUI also integrates with external vector databases — but for most personal and small team use cases, the built-in document upload handles things well enough.


GPU Server Deployment on Hetzner

The E2B and E4B models run fine on consumer hardware. But if you want to run the 26B MoE or 31B Dense models with good performance — especially for team use or API serving — you need a GPU server.

Hetzner's dedicated GPU servers offer the best price-to-performance ratio for this. Their GEX44 plan with an NVIDIA RTX 4000 Ada (20 GB VRAM) starts at €184/month — roughly 75% cheaper than equivalent AWS GPU instances. See our complete Hetzner GPU setup guide for detailed pricing and server selection.

Server Setup

1. Provision a Hetzner dedicated GPU server.

Order a GEX44 or higher from Hetzner's Robot panel. Choose Ubuntu 24.04 as the OS.

2. Install NVIDIA drivers and Ollama.

# SSH into your server
ssh root@your-server-ip

# Install NVIDIA drivers (Ubuntu 24.04)
apt update && apt install -y nvidia-driver-560

# Reboot to load drivers
reboot

After reboot, verify the GPU is detected:

nvidia-smi

3. Install Ollama with GPU support.

curl -fsSL https://ollama.com/install.sh | sh

Ollama auto-detects NVIDIA GPUs. No additional configuration needed.

4. Pull and run the 31B model.

ollama pull gemma4:31b
ollama run gemma4:31b

With 20 GB VRAM on the RTX 4000 Ada, the 31B model fits with room for a reasonable context window. For the 26B MoE model, you get even more headroom since only 3.8B parameters activate at inference time.

Expose via Open WebUI

For team access, deploy Open WebUI on the same server:

docker run -d \
  --name open-webui \
  -p 3000:8080 \
  --gpus all \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://localhost:11434 \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Set up a reverse proxy (Nginx or Caddy) with HTTPS, and your team has a private, self-hosted AI chat running Gemma 4 on dedicated GPU hardware. Our self-hosted dev stack guide covers Caddy reverse proxy setup if you need it.

Cost Comparison

Setup Monthly Cost Model Size Speed
MacBook Pro M3 (local) $0 (already owned) E4B ~25 tok/s
Hetzner CX22 VPS (CPU-only) ~€5/month E2B ~5-10 tok/s
Hetzner GEX44 GPU server ~€184/month 31B Dense ~15-20 tok/s
AWS g5.xlarge (A10G) ~$750/month 31B Dense ~15-20 tok/s

For teams that need the 31B model running 24/7, Hetzner saves roughly €550/month compared to AWS for equivalent GPU performance.


Benchmarks — Gemma 4 vs. the Competition

Open model benchmarks shift fast. Here is where Gemma 4 stands as of April 2026, based on published results and leaderboard rankings.

Flagship Results (31B Dense)

Benchmark Gemma 4 31B Llama 4 Scout Qwen 3.5 27B DeepSeek-V3.2
Arena Elo (text) 1452 (#3) ~1420 ~1430 ~1460
MMLU Pro 85.2% 83.1% 84.8% 87.5%
AIME 2026 (math) 89.2% 82.5% 86.1% 92.3%
GPQA Diamond (science) 84.3% 79.8% 82.1% 86.7%
LiveCodeBench v6 80.0% 75.3% 78.4% 83.2%
Codeforces Elo 2150 1980 2050 2280

Sources: Arena AI leaderboard, ai.rs comparison, Lushbinary benchmarks

What the Benchmarks Tell You

Gemma 4 wins on efficiency. At 31B parameters, it outperforms Llama 4 Scout (which uses a much larger MoE architecture) and matches or beats Qwen 3.5 27B on most tasks. For its size, it is the strongest open model available.

DeepSeek still leads on raw reasoning. If you need the absolute best performance on complex math, competitive coding, and chain-of-thought reasoning, DeepSeek-V3.2 remains ahead. But it is also significantly larger and more expensive to run.

The 26B MoE is the real story. With only 3.8B active parameters, the 26B model delivers quality close to the 31B dense model at a fraction of the compute cost. This is the model that most developers should try first on capable hardware.

E2B and E4B have no real competition at their size. The E2B with native multimodal support (text + image + audio) and 128K context in a 2B-parameter model has no equivalent in the Llama 4 or Qwen 3.5 families. These are genuinely new capabilities at this size tier.

An Honest Assessment

Gemma 4 is not the best open model at everything. It trails Chinese competitors (DeepSeek, Qwen) on deep reasoning benchmarks. Its 31B flagship is competitive but not dominant. Where Gemma 4 genuinely excels is the breadth of its lineup — four sizes covering everything from edge devices to workstations — and the multimodal capabilities baked into every variant.

For most local AI use cases (chat, code completion, document Q&A, simple agentic tasks), Gemma 4 is more than capable. For research-grade reasoning or competitive coding, you may want to pair it with DeepSeek or Qwen for those specific tasks.


Use Cases — Where Gemma 4 Fits in Your Workflow

Coding Assistant

Use Gemma 4 as a local code completion and generation model in your IDE. The 31B model handles multi-file refactoring and architectural decisions well. The E4B handles function-level generation and explanations. Pair it with tools like Continue.dev or your IDE's Ollama integration.

If you are building a free AI coding stack, Gemma 4 via Ollama is one of the strongest local model options available — it costs nothing to run and works offline. You can also run Gemma 4 through Docker Model Runner if Docker is already part of your workflow.

Private Chat Interface

Run Open WebUI with Gemma 4 for a completely private ChatGPT alternative. No data leaves your machine. This is especially valuable for conversations involving proprietary code, confidential business information, or personal data.

RAG and Document Q&A

Gemma 4's 256K context window (on 26B and 31B) means you can feed in entire documents without chunking for many use cases. For larger document sets, pair it with a vector database through Open WebUI's RAG integration.

Embeddings

While Gemma 4 is primarily a generative model, the E2B variant works as a lightweight embedding model for local search and similarity applications. For dedicated embedding tasks, you may still prefer specialized models like nomic-embed-text, but Gemma 4 can handle both generation and basic embedding in a single model.

Edge and Mobile Deployment

The E2B model is explicitly designed for on-device deployment. Google has announced Gemma 4 support in Android AICore for on-device inference (source), and NVIDIA has published acceleration guides for running Gemma 4 on RTX hardware (source).

Agentic Workflows

Gemma 4's native function calling support makes it suitable for agentic workflows where the model needs to invoke tools, query databases, or call APIs. The 26B and 31B models handle multi-step reasoning chains reliably.


Troubleshooting Common Issues

Model fails to load or crashes

Cause: Not enough RAM or VRAM. Ollama will try to load the model and fail silently or crash.

Fix: Drop to a smaller model variant. If E4B crashes on 8 GB RAM, use E2B instead.

ollama run gemma4:e2b

Very slow generation (< 1 tok/s)

Cause: Model is running on CPU when it should be on GPU, or the model is too large for available memory and is swapping.

Fix: Check if Ollama detects your GPU:

ollama ps

If GPU is not listed, ensure NVIDIA drivers are installed (nvidia-smi should show your card). On macOS with Apple Silicon, GPU acceleration is automatic.

Open WebUI shows no models

Cause: Open WebUI cannot reach the Ollama server.

Fix: Ensure Ollama is running and check the OLLAMA_BASE_URL environment variable in your Docker run command. For Docker Desktop on Mac/Windows, use http://host.docker.internal:11434. For Linux, use http://localhost:11434 if running on the same machine, or use --network host.

Context window runs out mid-conversation

Cause: Default context window in Ollama is 2048 tokens. Gemma 4 supports much more but you need to set it.

Fix:

# In the Ollama chat, increase context
/set parameter num_ctx 32768

# Or when starting the model
ollama run gemma4 --num-ctx 32768

Higher context uses more memory. Scale according to your available RAM/VRAM.


Quick Reference Card

Task Command
Install Ollama (macOS) brew install ollama
Install Ollama (Linux) curl -fsSL https://ollama.com/install.sh | sh
Pull default model (E4B) ollama pull gemma4
Pull specific variant ollama pull gemma4:e2b / gemma4:26b / gemma4:31b
Run model ollama run gemma4
List downloaded models ollama list
Check running models ollama ps
Update model ollama pull gemma4 (re-pull)
Set context window /set parameter num_ctx 32768
Run Open WebUI docker run -d -p 3000:8080 ... (see above)

What to Read Next


This article may contain affiliate links to products or services we recommend. If you purchase through these links, we may earn a small commission at no extra cost to you. This helps support Effloow and allows us to continue creating free, high-quality content. See our affiliate disclosure for full details.