You've probably seen the queries flying around - "gemma4 ollama tags 2026", "gemma 4 ollama available models", "how much VRAM does gemma 4 need". Google dropped Gemma 4 quietly on April 2nd and it's already one of the most downloaded model families on Ollama. The problem? The model naming is confusing (what even is E2B vs E4B?), and nobody's written a clean breakdown of what runs on what hardware. This is that guide.
I'll cover every available Ollama tag, the exact VRAM you need for each, real benchmark numbers, and how to get up and running in under five minutes.
- 6GB VRAM / 8GB RAM:
ollama run gemma4:e2b- blazing fast, multimodal - 8GB VRAM / 16GB RAM:
ollama run gemma4:e4b- best laptop pick - 16GB VRAM / 24GB RAM:
ollama run gemma4:26b- MoE, surprisingly fast - 24GB VRAM / 32GB RAM:
ollama run gemma4:31b- flagship, near-GPT-4 class - Apple Silicon 16GB+:
gemma4:e4bworks great; 32GB+ trygemma4:26b
What is Gemma 4?
Gemma 4 is Google DeepMind's fourth generation of open-weight models, released April 2, 2026. It's built on the same research foundation as Gemini 3 - which means it punches well above its parameter count. The entire family is multimodal (text + image input), supports thinking/reasoning modes, and ships under Apache 2.0, so you can use it commercially without restrictions.
What makes it a big deal for local AI users specifically: the smaller models (E2B and E4B) are purpose-built for on-device use, with 128K context windows and extremely low VRAM requirements. The larger 26B MoE and 31B Dense models bring frontier-level intelligence to consumer hardware.
Gemma 4 vs Gemma 3: This isn't a small update. Gemma 4 31B scores 89.2% on AIME 2026 math. Gemma 3 27B scored 20.8% on the same test. On LiveCodeBench, it's 80% vs 29.1%. Different category of model entirely.
Gemma 4 model lineup explained
Before getting into Ollama tags, let's clear up the naming. The "E" in E2B and E4B stands for Effective parameters - these are edge-optimized models where the number reflects what's actually doing useful compute, not total parameter count. Here's what each model actually is:
The smallest Gemma 4 model. 2.3B effective parameters (5.1B with embeddings). Designed for laptops with integrated GPUs, phones, and Raspberry Pi-class hardware. Surprisingly capable for its size - handles Q&A, summarization, basic coding, and image captioning. Native audio support is exclusive to this size and E4B.
ollama run gemma4:e2b
4.5B effective parameters (8B with embeddings). The sweet spot for everyday laptop use. Fits comfortably in 8GB VRAM or 16GB unified memory on Apple Silicon. Handles coding assistance, writing, image analysis, and extended reasoning sessions well. This is the model most Ollama users with a mainstream GPU should start with.
ollama run gemma4:e4b
26B total parameters with only ~4B active per inference - that's the MoE advantage. The full 26B loads into memory, but each forward pass only activates a small subset of experts, making it faster than a comparable dense model. Ranked #6 open model globally on Arena AI. Excellent for complex reasoning, long documents, and coding agent workflows. Requires 24GB+ RAM (unified or system) to run comfortably.
ollama run gemma4:26b
The full dense 31B model. Currently ranked #3 open model in the world on Arena AI's text leaderboard - above models 20x its size. Best choice when you need maximum reasoning quality and have a 24GB VRAM GPU (RTX 4090, RTX 5090) or 32GB+ Apple Silicon. AIME 2026 math: 89.2%. LiveCodeBench: 80%. GPQA Diamond: 84.3%.
ollama run gemma4:31b
All Gemma 4 Ollama tags
Here's the complete list of available Ollama tags as of late April 2026:
| Ollama Tag | Model | Size on Disk | Context | Modalities |
|---|---|---|---|---|
gemma4:e2b |
E2B (Edge 2B) | ~3.5 GB | 128K | Text, Image, Audio |
gemma4:e4b |
E4B (Edge 4B) | ~5.5 GB | 128K | Text, Image, Audio |
gemma4:26b |
26B MoE | ~16 GB | 256K | Text, Image |
gemma4:31b |
31B Dense | ~19 GB | 256K | Text, Image |
gemma4:latest |
Default (E4B) | ~5.5 GB | 128K | Text, Image, Audio |
Use official tags only. In the days after Gemma 4's release, some community GGUF builds had broken quantizations and tool-call failures. As of mid-April these are patched, but always pull from the official ollama.com/library/gemma4 tags or verified Unsloth builds - not random re-uploads.
VRAM requirements by GPU
Here's what you can actually run based on your hardware. VRAM figures assume Q4 quantization (Ollama's default for most GPUs) and 8K context. For longer contexts, add 1-3GB depending on the window size you're using.
| GPU | VRAM | Best Gemma 4 Model | Ollama Command | Tok/sec (est.) |
|---|---|---|---|---|
| GTX 1660 / RTX 3050 | 6GB | E2B OK | gemma4:e2b |
~45-55 |
| RTX 4060 / RTX 3070 | 8GB | E4B OK | gemma4:e4b |
~35-45 |
| RTX 4070 / RTX 3080 | 12GB | E4B OK | gemma4:e4b |
~50-60 |
| RTX 4080 / RTX 5080 | 16GB | 26B MoE OK | gemma4:26b |
~25-32 |
| RTX 4090 / RTX 5090 | 24GB | 31B Dense OK | gemma4:31b |
~20-28 |
| Apple M1/M2/M3/M4 (16GB) | 16GB unified | E4B OK | gemma4:e4b |
~30-40 |
| Apple M2/M3/M4 Pro (24GB+) | 24GB+ unified | 26B MoE OK | gemma4:26b |
~18-25 |
| Apple M4 Max (64-128GB) | 64-128GB unified | 31B Dense OK | gemma4:31b |
~22-30 |
Apple Silicon tip: Ollama v0.19+ automatically uses Apple's MLX framework for faster inference on M-series chips. If you're seeing slower-than-expected speeds, run ollama --version and update if you're below 0.19.
Benchmark scores: Gemma 4 vs Gemma 3
The jump from Gemma 3 to Gemma 4 is one of the biggest generational leaps in the open-model space. Here are the official Google DeepMind benchmark numbers:
| Benchmark | Gemma 3 27B | Gemma 4 31B | Improvement |
|---|---|---|---|
| AIME 2026 (Math) | 20.8% | 89.2% | +68.4 pts |
| LiveCodeBench v6 (Coding) | 29.1% | 80.0% | +50.9 pts |
| GPQA Diamond (Science) | 42.4% | 84.3% | +41.9 pts |
| Arena AI Leaderboard Rank | Not ranked | #3 open model | - |
The 26B MoE also holds its own - it ranks #6 globally on Arena AI, which is extraordinary given its hardware requirements are lower than most models ranked above it.
How to run Gemma 4 on Ollama: step-by-step
This takes about five minutes if you already have Ollama installed.
Step 1: Install Ollama
Step 2: Pull your Gemma 4 model
Step 3: Run and test it
Step 4: Check it's running on GPU
VS Code integration: Install the Continue.dev extension, set provider to Ollama, and use gemma4:e4b as your local coding model. It handles multimodal input too - you can paste screenshots of errors directly into the chat.
Key Gemma 4 features worth knowing
Thinking mode (configurable reasoning)
All Gemma 4 models support configurable thinking modes - essentially a built-in chain-of-thought that you can turn on or off. For complex math or coding tasks, thinking mode is worth enabling. For simple Q&A where speed matters more, disable it.
Multimodal image input
All four models handle image input, with variable aspect ratio and resolution support. You control the visual token budget - lower budgets (70-140 tokens) for fast classification and captioning, higher budgets (560-1120 tokens) when you need fine-grained image understanding. Ollama handles this automatically, but you can configure it via the API.
Native function calling
Gemma 4 supports native function calling and structured JSON output - critical for building local AI agents. Combined with the 256K context window on the 26B and 31B models, this makes it a serious option for repository-level coding agents and autonomous workflows.
Native system prompt support
Unlike earlier Gemma generations that required workarounds, Gemma 4 uses standard system, assistant, and user roles natively. Ollama handles the chat template automatically - you don't need to configure anything.
Gemma 4 vs Qwen2.5 vs Llama 3: which should you run?
| Model | Best For | VRAM (8B tier) | Multimodal | License |
|---|---|---|---|---|
| Gemma 4 E4B | General use, image Q&A, coding | ~5.5GB | Text + Image + Audio | Apache 2.0 |
| Qwen2.5-Coder:7B | Pure coding, Python/JS | ~4.6GB | Text only | Apache 2.0 |
| Llama 3.1:8B | General chat, writing | ~5.2GB | Text only | Meta License |
| Gemma 4 26B MoE | Complex reasoning, long docs | ~16GB | Text + Image | Apache 2.0 |
If your primary use case is coding, Qwen2.5-Coder still has an edge at the 7B tier - it was trained almost entirely on code. But if you want a single model that handles coding, image analysis, reasoning, and general tasks, Gemma 4 E4B is the better all-rounder. For the 26B+ tier, Gemma 4 is comfortably ahead of everything at equivalent VRAM.
Known issues and fixes
Tool calling + reasoning mode conflict: If you're using Gemma 4 with a coding agent like OpenClaw and tool calls are failing, set "reasoning": false in your model config. Reasoning mode can cause formatting issues with expected tool-call output.
Context window pressure on 16GB machines: Running gemma4:26b with a 128K+ context on a 16GB unified memory Mac can cause quality degradation as the system starts swapping. Set contextWindow: 32768 in your config if you notice slower generation or inconsistent output.
Older Ollama versions: If you're on Ollama below v0.19, Apple Silicon won't use MLX acceleration. Update with curl -fsSL https://ollama.com/install.sh | sh - it handles upgrades cleanly.
Frequently asked questions
gemma4:e2b, gemma4:e4b, gemma4:26b, and gemma4:31b. Use gemma4:latest to get the recommended default (currently E4B). Always pull from the official Ollama library - community re-uploads had instability issues in the first week after release.
ollama run gemma4:e4b "Describe this image" --images /path/to/image.jpg. Via the API, pass the image as base64 in the messages array.