You've probably seen the queries flying around - "gemma4 ollama tags 2026", "gemma 4 ollama available models", "how much VRAM does gemma 4 need". Google dropped Gemma 4 quietly on April 2nd and it's already one of the most downloaded model families on Ollama. The problem? The model naming is confusing (what even is E2B vs E4B?), and nobody's written a clean breakdown of what runs on what hardware. This is that guide.

I'll cover every available Ollama tag, the exact VRAM you need for each, real benchmark numbers, and how to get up and running in under five minutes.

TL;DR - Gemma 4 Ollama Quick Reference
  • 6GB VRAM / 8GB RAM: ollama run gemma4:e2b - blazing fast, multimodal
  • 8GB VRAM / 16GB RAM: ollama run gemma4:e4b - best laptop pick
  • 16GB VRAM / 24GB RAM: ollama run gemma4:26b - MoE, surprisingly fast
  • 24GB VRAM / 32GB RAM: ollama run gemma4:31b - flagship, near-GPT-4 class
  • Apple Silicon 16GB+: gemma4:e4b works great; 32GB+ try gemma4:26b
4
Model sizes available on Ollama
89.2%
Gemma 4 31B on AIME 2026 Math
256K
Context window (26B & 31B)
Apache 2.0
License - free for commercial use

What is Gemma 4?

Gemma 4 is Google DeepMind's fourth generation of open-weight models, released April 2, 2026. It's built on the same research foundation as Gemini 3 - which means it punches well above its parameter count. The entire family is multimodal (text + image input), supports thinking/reasoning modes, and ships under Apache 2.0, so you can use it commercially without restrictions.

What makes it a big deal for local AI users specifically: the smaller models (E2B and E4B) are purpose-built for on-device use, with 128K context windows and extremely low VRAM requirements. The larger 26B MoE and 31B Dense models bring frontier-level intelligence to consumer hardware.

Gemma 4 vs Gemma 3: This isn't a small update. Gemma 4 31B scores 89.2% on AIME 2026 math. Gemma 3 27B scored 20.8% on the same test. On LiveCodeBench, it's 80% vs 29.1%. Different category of model entirely.

Loading products...

Gemma 4 model lineup explained

Before getting into Ollama tags, let's clear up the naming. The "E" in E2B and E4B stands for Effective parameters - these are edge-optimized models where the number reflects what's actually doing useful compute, not total parameter count. Here's what each model actually is:

Gemma 4 E2B Edge / Mobile 128K Context
VRAM: ~3GB (Q4) Download: ~3.5GB Fastest inference Text + Image + Audio

The smallest Gemma 4 model. 2.3B effective parameters (5.1B with embeddings). Designed for laptops with integrated GPUs, phones, and Raspberry Pi-class hardware. Surprisingly capable for its size - handles Q&A, summarization, basic coding, and image captioning. Native audio support is exclusive to this size and E4B.

ollama run gemma4:e2b
Gemma 4 E4B Best Laptop Pick 128K Context
VRAM: ~5.5GB (Q4) Download: ~5.5GB Very fast inference Text + Image + Audio

4.5B effective parameters (8B with embeddings). The sweet spot for everyday laptop use. Fits comfortably in 8GB VRAM or 16GB unified memory on Apple Silicon. Handles coding assistance, writing, image analysis, and extended reasoning sessions well. This is the model most Ollama users with a mainstream GPU should start with.

ollama run gemma4:e4b
Gemma 4 26B MoE Mixture of Experts 256K Context
VRAM: ~16GB (Q4) Download: ~16GB Fast for its size Text + Image

26B total parameters with only ~4B active per inference - that's the MoE advantage. The full 26B loads into memory, but each forward pass only activates a small subset of experts, making it faster than a comparable dense model. Ranked #6 open model globally on Arena AI. Excellent for complex reasoning, long documents, and coding agent workflows. Requires 24GB+ RAM (unified or system) to run comfortably.

ollama run gemma4:26b
Gemma 4 31B Dense Flagship 256K Context
VRAM: ~19GB (Q4) Download: ~19GB Moderate speed Text + Image

The full dense 31B model. Currently ranked #3 open model in the world on Arena AI's text leaderboard - above models 20x its size. Best choice when you need maximum reasoning quality and have a 24GB VRAM GPU (RTX 4090, RTX 5090) or 32GB+ Apple Silicon. AIME 2026 math: 89.2%. LiveCodeBench: 80%. GPQA Diamond: 84.3%.

ollama run gemma4:31b

All Gemma 4 Ollama tags

Here's the complete list of available Ollama tags as of late April 2026:

Ollama Tag Model Size on Disk Context Modalities
gemma4:e2b E2B (Edge 2B) ~3.5 GB 128K Text, Image, Audio
gemma4:e4b E4B (Edge 4B) ~5.5 GB 128K Text, Image, Audio
gemma4:26b 26B MoE ~16 GB 256K Text, Image
gemma4:31b 31B Dense ~19 GB 256K Text, Image
gemma4:latest Default (E4B) ~5.5 GB 128K Text, Image, Audio

Use official tags only. In the days after Gemma 4's release, some community GGUF builds had broken quantizations and tool-call failures. As of mid-April these are patched, but always pull from the official ollama.com/library/gemma4 tags or verified Unsloth builds - not random re-uploads.

VRAM requirements by GPU

Here's what you can actually run based on your hardware. VRAM figures assume Q4 quantization (Ollama's default for most GPUs) and 8K context. For longer contexts, add 1-3GB depending on the window size you're using.

GPU VRAM Best Gemma 4 Model Ollama Command Tok/sec (est.)
GTX 1660 / RTX 3050 6GB E2B OK gemma4:e2b ~45-55
RTX 4060 / RTX 3070 8GB E4B OK gemma4:e4b ~35-45
RTX 4070 / RTX 3080 12GB E4B OK gemma4:e4b ~50-60
RTX 4080 / RTX 5080 16GB 26B MoE OK gemma4:26b ~25-32
RTX 4090 / RTX 5090 24GB 31B Dense OK gemma4:31b ~20-28
Apple M1/M2/M3/M4 (16GB) 16GB unified E4B OK gemma4:e4b ~30-40
Apple M2/M3/M4 Pro (24GB+) 24GB+ unified 26B MoE OK gemma4:26b ~18-25
Apple M4 Max (64-128GB) 64-128GB unified 31B Dense OK gemma4:31b ~22-30

Apple Silicon tip: Ollama v0.19+ automatically uses Apple's MLX framework for faster inference on M-series chips. If you're seeing slower-than-expected speeds, run ollama --version and update if you're below 0.19.

Benchmark scores: Gemma 4 vs Gemma 3

The jump from Gemma 3 to Gemma 4 is one of the biggest generational leaps in the open-model space. Here are the official Google DeepMind benchmark numbers:

Benchmark Gemma 3 27B Gemma 4 31B Improvement
AIME 2026 (Math) 20.8% 89.2% +68.4 pts
LiveCodeBench v6 (Coding) 29.1% 80.0% +50.9 pts
GPQA Diamond (Science) 42.4% 84.3% +41.9 pts
Arena AI Leaderboard Rank Not ranked #3 open model -

The 26B MoE also holds its own - it ranks #6 globally on Arena AI, which is extraordinary given its hardware requirements are lower than most models ranked above it.

How to run Gemma 4 on Ollama: step-by-step

This takes about five minutes if you already have Ollama installed.

Step 1: Install Ollama

# Linux / macOS curl -fsSL https://ollama.com/install.sh | sh # Windows - download installer from ollama.com/download

Step 2: Pull your Gemma 4 model

# For most laptops (8GB VRAM / 16GB RAM) ollama pull gemma4:e4b # For budget GPUs / integrated graphics (6GB VRAM) ollama pull gemma4:e2b # For high-end GPUs (16GB+ VRAM) ollama pull gemma4:26b # For RTX 4090 / 5090 or Mac with 32GB+ unified memory ollama pull gemma4:31b

Step 3: Run and test it

# Basic text chat ollama run gemma4:e4b # Test coding ability ollama run gemma4:e4b "Write a Python function to parse JSON from a URL" # Check VRAM usage and speed ollama run gemma4:e4b --verbose

Step 4: Check it's running on GPU

# Linux/Windows - check GPU utilization nvidia-smi # macOS - check in Activity Monitor -> GPU tab # Or use: ollama ps ollama ps

VS Code integration: Install the Continue.dev extension, set provider to Ollama, and use gemma4:e4b as your local coding model. It handles multimodal input too - you can paste screenshots of errors directly into the chat.

Key Gemma 4 features worth knowing

Thinking mode (configurable reasoning)

All Gemma 4 models support configurable thinking modes - essentially a built-in chain-of-thought that you can turn on or off. For complex math or coding tasks, thinking mode is worth enabling. For simple Q&A where speed matters more, disable it.

Multimodal image input

All four models handle image input, with variable aspect ratio and resolution support. You control the visual token budget - lower budgets (70-140 tokens) for fast classification and captioning, higher budgets (560-1120 tokens) when you need fine-grained image understanding. Ollama handles this automatically, but you can configure it via the API.

Native function calling

Gemma 4 supports native function calling and structured JSON output - critical for building local AI agents. Combined with the 256K context window on the 26B and 31B models, this makes it a serious option for repository-level coding agents and autonomous workflows.

Native system prompt support

Unlike earlier Gemma generations that required workarounds, Gemma 4 uses standard system, assistant, and user roles natively. Ollama handles the chat template automatically - you don't need to configure anything.

Gemma 4 vs Qwen2.5 vs Llama 3: which should you run?

Model Best For VRAM (8B tier) Multimodal License
Gemma 4 E4B General use, image Q&A, coding ~5.5GB Text + Image + Audio Apache 2.0
Qwen2.5-Coder:7B Pure coding, Python/JS ~4.6GB Text only Apache 2.0
Llama 3.1:8B General chat, writing ~5.2GB Text only Meta License
Gemma 4 26B MoE Complex reasoning, long docs ~16GB Text + Image Apache 2.0

If your primary use case is coding, Qwen2.5-Coder still has an edge at the 7B tier - it was trained almost entirely on code. But if you want a single model that handles coding, image analysis, reasoning, and general tasks, Gemma 4 E4B is the better all-rounder. For the 26B+ tier, Gemma 4 is comfortably ahead of everything at equivalent VRAM.

Known issues and fixes

Tool calling + reasoning mode conflict: If you're using Gemma 4 with a coding agent like OpenClaw and tool calls are failing, set "reasoning": false in your model config. Reasoning mode can cause formatting issues with expected tool-call output.

Context window pressure on 16GB machines: Running gemma4:26b with a 128K+ context on a 16GB unified memory Mac can cause quality degradation as the system starts swapping. Set contextWindow: 32768 in your config if you notice slower generation or inconsistent output.

Older Ollama versions: If you're on Ollama below v0.19, Apple Silicon won't use MLX acceleration. Update with curl -fsSL https://ollama.com/install.sh | sh - it handles upgrades cleanly.


Frequently asked questions

The official tags are: gemma4:e2b, gemma4:e4b, gemma4:26b, and gemma4:31b. Use gemma4:latest to get the recommended default (currently E4B). Always pull from the official Ollama library - community re-uploads had instability issues in the first week after release.
E2B needs ~3GB VRAM, E4B ~5.5GB, 26B MoE ~16GB, and 31B Dense ~19GB - all with Q4 quantization. Add 1-3GB if you're using long context windows (32K+). On Apple Silicon, unified memory fills the VRAM role, so a 16GB M-series Mac comfortably runs E4B.
Dramatically so. Gemma 4 31B scores 80% on LiveCodeBench v6 vs Gemma 3 27B's 29.1%. Even the smaller E4B model shows significant improvements in code generation and debugging over Gemma 3's equivalent. If you've been running Gemma 3 for coding, Gemma 4 is worth the upgrade.
Yes, Ollama will fall back to CPU if no GPU is detected. The E2B model is usable on CPU (modern 8-core machine: ~3-6 tokens/sec). E4B on CPU drops to ~1-3 tokens/sec - workable for occasional queries but not for a real-time coding assistant. A GPU is strongly recommended for anything above E2B.
The "E" stands for Effective parameters - a measure of compute-efficient parameters rather than total parameter count. E2B has 2.3B effective parameters (5.1B total with embeddings); E4B has 4.5B effective (8B total). The 26B and 31B models use their actual parameter count. The E-series models are specifically optimized for edge/local deployment with lower latency and VRAM use.
Yes - all four Gemma 4 models handle image input. E2B and E4B also support audio natively. To send an image via Ollama CLI: ollama run gemma4:e4b "Describe this image" --images /path/to/image.jpg. Via the API, pass the image as base64 in the messages array.
Yes. Gemma 4 ships under the Apache 2.0 license, which allows commercial use, modification, and redistribution without paying Google. This is a notable upgrade from earlier Gemma versions and makes it one of the most permissively licensed frontier-class open models available.