Effloow
← Back to Articles
ARTICLES ·2026-05-01 ·BY EFFLOOW CONTENT FACTORY

Gemma 4 26B vs 31B: Which Model to Run Locally

Gemma 4 26B MoE vs 31B Dense: VRAM requirements, quantization guide, thinking mode setup, and how to choose the right model for your hardware.
gemma-4 ollama llm local-ai google moe quantization ai-infrastructure
SHARE
Gemma 4 26B vs 31B: Which Model to Run Locally

Gemma 4 26B vs 31B: Which Model to Run Locally with Ollama

If you followed our Gemma 4 Local Setup Guide, you already have Ollama running and know the four model sizes exist. But once you're past the basics, a real question stays open: between the 26B MoE and the 31B Dense, which one actually belongs on your machine?

They're not just different sizes. They're fundamentally different architectures with different speed profiles, different memory footprints, and meaningfully different strengths. This guide cuts through the surface-level comparison and gives you a clear framework for making the choice.


Why the 26B and 31B Are Different in a Non-Obvious Way

Most model comparisons treat parameter count as the main variable. More parameters equals better quality, slower inference, more VRAM. That's true for dense models, where every parameter is used during every forward pass.

Gemma 4 26B breaks that assumption. It's a Mixture of Experts (MoE) model — a design where the network learns to route each token through a subset of specialized "expert" layers rather than the full set. The 26B model has 26 billion total parameters, but during any single inference pass, only 3.8 billion are active (source).

The practical consequence: the 26B MoE runs at roughly the same speed as a 4B dense model while delivering quality that competes with 13B+ models. On the Arena AI text leaderboard, the 26B MoE holds the #6 spot among all open models. The 31B Dense holds #3.

Both are large-model slots. Neither is a shortcut.

PropertyGemma 4 26B MoEGemma 4 31B Dense
Total parameters26B31B
Active params per inference~3.8B31B
ArchitectureMixture of ExpertsDense
Arena AI leaderboard rank#6#3
Context window256K tokens256K tokens
Multimodal (vision)YesYes
Thinking modeYesYes
LicenseApache 2.0Apache 2.0
Best forSpeed-sensitive workflowsMax quality, fine-tuning

VRAM Requirements and Quantization

This is where the decision actually gets made. Both models run on consumer hardware when quantized — but the window is tighter than it looks.

Quantization Options in Ollama

Ollama ships GGUF quantizations for both models. The two you'll care about most are:

Q4_K_M — The default when you run ollama run gemma4:26b or ollama run gemma4:31b. Uses 4-bit quantization with medium key-value precision. About 3–4x faster than FP16, with a small quality hit on complex math and multi-step reasoning. For chat, code generation, summarization, and general Q&A, the gap is barely measurable (source).

Q8_0 — 8-bit quantization, essentially lossless compared to FP16. Twice as slow as Q4 and requires roughly double the VRAM. Worth it when you can afford the hardware and the task genuinely demands precision — long-form reasoning, competitive coding benchmarks, structured tool use chains (source).

ModelQuantizationVRAM RequiredSpeed (RTX 4090)Quality
26B MoEQ4_K_M (default)~14–18 GB~40–45 tok/sStrong
26B MoEQ8_0~26–28 GB~20–22 tok/sNear-lossless
31B DenseQ4_K_M (default)~20–22 GB~35–38 tok/sExcellent
31B DenseQ8_0~34–36 GB~18–20 tok/sNear-lossless

Speed figures sourced from community benchmarks on RTX 4090 (source).

Hardware Fit at a Glance

RTX 3090 (24 GB VRAM): 26B Q4_K_M fits comfortably. 31B Q4_K_M is tight — it fits, but you may see OOM with very long context. Neither Q8_0 fits.

RTX 4090 (24 GB VRAM): Same story as 3090, but with faster throughput. 26B Q4_K_M runs at 40–45 tok/s. 31B Q4_K_M runs at 35–38 tok/s.

Apple M2/M3 Pro (24 GB unified memory): 26B Q4_K_M fits. 31B Q4_K_M does not. If you have a 24GB Mac, the 26B is your ceiling.

Apple M2/M3 Max (32–36 GB): Both models fit at Q4_K_M. 26B Q8_0 also fits. 31B Q8_0 does not.

Apple M2/M3 Ultra (64–192 GB) or A100 (80 GB): All configurations fit. This is where 31B Q8_0 becomes practical.


Installing Both Models with Ollama

If you don't have Ollama yet, install it from ollama.com. It runs as a background service and exposes a local API at http://localhost:11434.

Pull the Models

# 26B MoE — default Q4_K_M (~14 GB download)
ollama pull gemma4:26b

# 26B MoE — Q8_0 for higher quality (~26 GB download)
ollama pull gemma4:26b-a4b-it-q8_0

# 31B Dense — default Q4_K_M (~20 GB download)
ollama pull gemma4:31b

# 31B Dense — Q8_0 (~36 GB download)
ollama pull gemma4:31b-it-q8_0

Run an Interactive Session

# Interactive chat with 26B MoE
ollama run gemma4:26b

# Interactive chat with 31B Dense
ollama run gemma4:31b

Query via API

Both models expose the same Ollama API endpoint:

curl http://localhost:11434/api/generate \
  -d '{
    "model": "gemma4:26b",
    "prompt": "Explain the difference between MoE and Dense transformer architectures.",
    "stream": false
  }'

Switch gemma4:26b to gemma4:31b to use the Dense model — the interface is identical.


Enabling Thinking Mode

Both the 26B and 31B models support Gemma 4's configurable "thinking mode" — a chain-of-thought reasoning layer that the model shows before producing its final answer. It works well for tasks where intermediate reasoning matters: complex analysis, structured plans, multi-step math.

Turn Thinking On

Add <|think|> at the beginning of your system prompt:

import ollama

response = ollama.chat(
    model="gemma4:31b",
    messages=[
        {
            "role": "system",
            "content": "<|think|>\nYou are a senior software architect. Analyze problems step by step before giving a recommendation."
        },
        {
            "role": "user",
            "content": "Should we use a message queue or direct HTTP calls between these two services?"
        }
    ]
)
print(response['message']['content'])

Turn Thinking Off

Simply remove the <|think|> token from the system prompt. The model responds directly without showing its reasoning chain — faster and cleaner for production responses where the intermediate steps don't matter.

Edge Case on Larger Models

The 26B and 31B models sometimes generate thinking output even when the token is absent. If you see stray <think> blocks in responses when thinking is disabled, stabilize the behavior by adding an empty thought channel to your prompt (source):

# Add this to your system prompt when thinking mode is off but the model misbehaves
"content": "You are a helpful assistant.\n<|channel>thought<channel|>"

This signals to the model that the thinking channel is present but intentionally empty, preventing spurious thought generation.


Choosing the Right Model for Your Use Case

The architecture difference creates a clear split in where each model wins.

Use 26B MoE When

Interactive development assistant. You're using the model in a REPL, IDE extension, or chat interface where response latency matters. The 40–45 tok/s on a 4090 means answers arrive in 2–4 seconds. The 31B takes 30–50% longer for the same output length (source).

Hardware is the constraint. You have a 24 GB GPU or a Mac with ≤24 GB unified memory. The 26B Q4_K_M is the best model that fits. There's no point comparing it to the 31B — the 31B won't run on your hardware.

Agentic pipelines with many calls. When a workflow makes 20–50 model calls per task (tool selection, intermediate reasoning, response formatting), the 2.5x speed advantage compounds. A pipeline that takes 3 minutes with 26B takes 7–8 minutes with 31B.

Summarization, classification, extraction. Tasks where quality is already saturated at lower capability levels. There's no perceptible difference between 26B and 31B for extracting structured data from a paragraph.

Use 31B Dense When

Maximum quality on reasoning tasks. The 31B's advantages are clearest on AIME math (89.2%), LiveCodeBench competitive coding (80%), and GPQA scientific reasoning (84.3%). If your task requires sustained multi-step reasoning, the Dense architecture's full parameter activation helps (source).

Fine-tuning base. If you plan to fine-tune for a domain-specific task, the Dense architecture is simpler to work with. MoE fine-tuning requires routing-aware optimization and is meaningfully more complex. Use the 31B Dense as your starting point.

You have the hardware. If you have a 32 GB+ Mac or a GPU setup with 24+ GB VRAM, the quality edge is worth having. The 31B running at 35 tok/s is still fast enough for most interactive workflows.

Highest-stakes single-shot outputs. Cover letters, legal summaries, architecture reviews — cases where you run the prompt once and act on the result. The extra quality margin of the Dense model costs you latency, not accuracy.

Which One?

For most developers on consumer hardware, the 26B MoE is the right call. It's 2–2.5x faster, fits in 24 GB VRAM, and closes 90–95% of the quality gap with the 31B Dense. Save the 31B for tasks where reasoning quality is genuinely the bottleneck, or as a fine-tuning base when you have the hardware for it.


Using Vision Input with Both Models

Both the 26B and 31B support multimodal image input. In Ollama, pass images as base64 strings or file paths:

import ollama
import base64

with open("architecture-diagram.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode()

response = ollama.chat(
    model="gemma4:26b",
    messages=[
        {
            "role": "user",
            "content": "Identify the bottlenecks in this system architecture.",
            "images": [image_data]
        }
    ]
)
print(response['message']['content'])

Gemma 4 supports variable image resolution through a configurable visual token budget (70, 140, 280, 560, or 1120 tokens). Ollama uses the model defaults — higher resolution images consume more tokens from your context window but preserve more visual detail (source).


Common Mistakes

Pulling both models without checking available VRAM first. The 26B Q4_K_M is 14 GB on disk and needs ~16–18 GB of VRAM including KV cache. The 31B Q4_K_M is ~20 GB and needs ~22 GB. Know your ceiling before pulling 30+ GB of model weights.

Using Q8_0 when Q4_K_M is sufficient. For the majority of developer tasks, Q4_K_M and Q8_0 produce indistinguishable results. Q8_0 doubles memory requirements for a quality difference that matters only on hard math and competitive coding. Start with Q4_K_M and upgrade to Q8_0 only if you observe quality issues on your specific workload.

Leaving thinking mode on for all tasks. Thinking mode adds tokens before the final answer, which slows output and consumes context. Turn it off for fast-turnaround tasks (chat, extraction, classification) and reserve it for deliberate reasoning chains.

Mixing up model tags. gemma4:26b is the same as gemma4:26b-a4b-it-q4_K_M — both are Q4_K_M. If you explicitly need Q8_0, pull gemma4:26b-a4b-it-q8_0. The default tag does not pick the highest quality; it picks the most compatible option for consumer hardware.

Assuming speed parity on Apple Silicon. On M-series Macs, the speed advantage of the 26B MoE over the 31B Dense narrows compared to NVIDIA GPUs. The unified memory architecture handles the Dense model's full parameter load more efficiently. Expect 20–25% speed difference on Apple Silicon vs 150–200% on NVIDIA.


Q: Can I run both models at the same time?

Not on a single GPU without enough VRAM for both. The 26B Q4_K_M takes ~16–18 GB and the 31B Q4_K_M takes ~22 GB — that's 38+ GB combined, which exceeds most consumer setups. If you need to switch models frequently, just use ollama run with the specific tag — Ollama keeps the active model in VRAM and swaps on demand. Swapping takes 5–15 seconds.

Q: Is the 26B MoE harder to prompt than the 31B Dense?

Not meaningfully. Both models respond to the same instruction formats and system prompt patterns. The main behavioral difference is that the 26B MoE can occasionally be more sensitive to ambiguous prompts — the routing mechanism may amplify uncertainty. On clear, specific prompts, behavior is nearly identical.

Q: Does thinking mode work for vision tasks?

Yes, but with caveats. In multi-turn conversations that include tool calls, the model requires that thinking output is not included between function calls. Keep thinking-mode output to final text turns only when building tool-use pipelines (source).

Q: How does 31B Dense compare to GPT-4o-mini for code generation?

On LiveCodeBench v6, Gemma 4 31B scores 80%. GPT-4o-mini's published score on the same benchmark is lower. For local self-hosted inference with no API costs and Apache 2.0 licensing, the 31B Dense is a strong alternative. Use case and specific language matter — the benchmark is competitive coding; production code generation results may differ.

Q: Which model should I use for RAG pipelines?

The 26B MoE for throughput-heavy pipelines where you're making many retrieval-augmented calls per minute. The 31B Dense if accuracy on the synthesis step is the bottleneck. Either model handles the 256K context window required for large document processing.


Key Takeaways

  • Gemma 4 26B is a Mixture of Experts model — only 3.8B parameters active per inference, giving 4B-class speed with 13B-class quality. The 31B Dense runs all 31B parameters every time.
  • Hardware determines the decision as much as quality does. If you have ≤24 GB VRAM, the 26B Q4_K_M is your answer. If you have 32 GB+, the 31B opens up.
  • Q4_K_M is the practical default for most workflows. Q8_0 is for fine-tuning pipelines, competitive benchmarks, and setups with abundant VRAM.
  • Thinking mode is per-prompt — add <|think|> to system prompts where reasoning chains matter, remove it for fast-turnaround tasks.
  • Both models are Apache 2.0. No usage restrictions, no registration, no API rate limits — just local inference at whatever throughput your hardware supports.

The Gemma 4 family is the strongest collection of local open models available as of mid-2026, and the 26B and 31B are where the real production decisions live. Choose the right architecture for your hardware and your task type — and you won't miss the frontier API costs.

Need content like this
for your blog?

We run AI-powered technical blogs. Start with a free 3-article pilot.

Learn more →

More in Articles

Stay in the loop.

One dispatch every Friday. New articles, tool releases, and a short note from the editor.

Get weekly AI tool reviews & automation tips

Join our newsletter. No spam, unsubscribe anytime.