If you've been searching for Qwen3-Coder Ollama tags, you're not alone. The query "qwen3-coder ollama tags 2026" has been one of the fastest-growing searches in the Ollama ecosystem this spring, and for good reason: Alibaba shipped Qwen3-Coder quietly but powerfully, and the Ollama library page doesn't exactly hold your hand. There are 10 available tags across three model sizes — 30B, 480B, and the newer Next variant — and picking the wrong one can leave you staring at an OOM error or wasting hours on a 290GB download you didn't need. This guide covers every available tag, exact pull commands, precise VRAM requirements per quantization, and a direct benchmark comparison against Qwen2.5-Coder so you know whether it's actually worth switching.

Quick Answer: Best Qwen3-Coder tag for Ollama (june 2026)

  • 16–20GB VRAM: ollama pull qwen3-coder:30b (19GB, 256K context, Q4_K_M default)
  • 32GB+ VRAM: ollama pull qwen3-coder:30b-a3b-q8_0 (32GB, higher accuracy)
  • Multi-GPU / datacenter: ollama pull qwen3-coder:480b (290GB, 480B total params)
  • 8GB VRAM (budget): Stick to qwen2.5-coder:7b — Qwen3-Coder doesn't have a sub-10GB tag yet.
bolt TL;DR — Qwen3-Coder Ollama at a Glance
  • Yes, it's real: Qwen3-Coder is live on Ollama with 4.5M+ downloads. The official library tag is qwen3-coder.
  • Two main sizes: 30B (MoE, 3B active params, 19GB download) and 480B (MoE, 35B active, 290GB). The 30B is the practical local choice.
  • Context window: 256K tokens natively on all tags — a major leap over Qwen2.5-Coder's 128K.
  • vs Qwen2.5-Coder: Qwen3-Coder-Next beats it on SWE-bench Verified (70%+ vs ~69.6%). For pure HumanEval on 8GB hardware, Qwen2.5-Coder:7B is still king.
  • Minimum hardware: RTX 4080 or RTX 5080 (16GB VRAM) for the 30B Q4_K_M tag. Not suitable for 8GB cards.

VRAM values assume Q4_K_M quantization and 8K context. 256K context adds ~8–12GB VRAM for KV cache.

Loading products...

Does Qwen3-Coder actually exist on Ollama?

Yes, and it's been there since late 2025. The confusion comes from the naming gap: people search for qwen3-coder ollama and land on the Qwen2.5-Coder page, or find unofficial community uploads, and wonder if the real thing exists. It does. The official library entry is ollama.com/library/qwen3-coder and it has over 4.5 million pulls as of june 2026 — that's a very active model.

Qwen3-Coder is Alibaba's dedicated agentic coding model family, built on top of the Qwen3 architecture. The "3" in the name refers to the third generation of Qwen models, not the parameter count. The key architectural features that separate it from Qwen2.5-Coder are the Mixture-of-Experts (MoE) design — which activates only a fraction of parameters per token — a native 256K token context window, and training data that skews heavily toward multi-turn agentic workflows and tool-calling rather than pure code completion.

Why people miss it: Searching "qwen3 coder" (with a space) in Ollama returns different results than "qwen3-coder" (hyphenated). The official tag is hyphenated. Also note that qwen3-coder-next is a separate, slightly newer variant — more on that below.

4.5M+
Ollama pulls (june 2026)
256K
Native context window (tokens)
10
Available tags on Ollama library
19GB
30B model download size (Q4_K_M)

All Available Qwen3-Coder Ollama Tags (Full List, june 2026)

There are currently 10 official tags on the Qwen3-Coder library page. They fall into two size families — 30B and 480B — each available in multiple quantizations. Here's the complete picture:

Ollama Tag Model Download Size Context Notes
qwen3-coder:latest
qwen3-coder:30b
30B-A3B (MoE) 19GB 256K Default. Q4_K_M. Same file — both tags point to 06c1097efce0. ← Start here
qwen3-coder:30b-a3b-q4_K_M 30B-A3B (MoE) 19GB 256K Explicit Q4_K_M. Identical to :30b tag. Use this if you want to be precise.
qwen3-coder:30b-a3b-q8_0 30B-A3B (MoE) 32GB 256K Higher accuracy, ~2% quality gain over Q4. Needs 32GB+ VRAM or 32GB RAM + GPU offload.
qwen3-coder:30b-a3b-fp16 30B-A3B (MoE) 61GB 256K Full precision. For researchers. Requires multi-GPU or Apple M3/M4 Max with 64GB+ unified memory.
qwen3-coder:480b
qwen3-coder:480b-a35b-q4_K_M
480B-A35B (MoE) 290GB 256K Datacenter-class. 35B active params. Not realistic on consumer hardware — requires NVLink multi-GPU.
qwen3-coder:480b-a35b-q8_0 480B-A35B (MoE) 510GB 256K Maximum quality 480B. Practically only for cloud inference or >512GB RAM server setups.
qwen3-coder:480b-a35b-fp16 480B-A35B (MoE) 960GB 256K Full precision 480B. ~1TB download. H100 cluster territory.
qwen3-coder:480b-cloud 480B-A35B (cloud) — (no local size) 256K Cloud-routed tag. Runs via Ollama's hosted inference — no local GPU required. Rate limits apply.

Tag data sourced directly from ollama.com/library/qwen3-coder/tags, june 2, 2026. The 30B and latest tags share the same underlying file hash (06c1097efce0).

Short answer to "what is the best Qwen3-Coder tag for Ollama?": For consumer GPUs, use qwen3-coder:30b. It's Q4_K_M by default, downloads 19GB, and runs on RTX 4080/5080 (16GB VRAM). Everything above 30B is datacenter territory unless you have an M4 Max or multi-GPU rig.

VRAM Requirements per Tag

The MoE architecture of Qwen3-Coder means the numbers here behave differently from dense models like Qwen2.5-Coder. All 30B parameters are loaded into memory (because MoE still loads all expert weights), but only 3B activate per token during inference. This gives you faster generation speeds than a typical 30B dense model, but the full model weight still needs to fit somewhere — VRAM or system RAM via CPU offloading.

Tag GPU VRAM (Inference) For 8K Context For 256K Context Recommended GPU
:30b / :30b-a3b-q4_K_M ~19GB ~20GB total ~30–32GB (KV cache grows) RTX 4080 16GB (tight) → RTX 5080 16GB or better
:30b-a3b-q8_0 ~32GB ~33GB total ~43GB+ RTX 4090 24GB (with offload) or dual RTX 5080
:30b-a3b-fp16 ~61GB ~62GB ~72GB+ M4 Max 64GB unified / dual RTX 4090 / A100 80GB
:480b / :480b-a35b-q4_K_M ~290GB ~292GB ~302GB+ 4× H100 80GB or equivalent
:480b-cloud 0 (cloud-routed) N/A N/A Any machine with internet connection

Context window VRAM overhead assumes FP16 KV cache (default). Use OLLAMA_KV_CACHE_TYPE=q8_0 to roughly halve the context overhead — this is important if you want to use the full 256K window on consumer hardware.

Best Ollama Models for 16GB RAM Laptop 2026: If you are running a laptop with only 16GB of unified memory or 16GB system RAM, the Qwen3-Coder 30B tag will be too slow due to aggressive swap. For a 16GB RAM laptop, the best ollama models for coding are dense 7B-9B models like qwen2.5-coder:7b or gemma2:9b.

RTX 4080 16GB users — read this first: The :30b tag technically fits at 19GB model weight, but you only have 16GB VRAM. Ollama will automatically offload layers to system RAM. You'll get 3–8 tok/s with 64GB+ DDR5 RAM, which is usable for testing but not great for daily agentic coding. The sweet spot for the 30B tag is a card with 20–24GB VRAM (RTX 5080, RTX 4090, or Apple M-series with 48GB+ unified memory).
Pro tip — use Q8 KV cache to unlock the 256K context window: Set OLLAMA_KV_CACHE_TYPE=q8_0 before launching Ollama. This compresses the KV cache from FP16 to INT8, cutting context overhead roughly in half. For the 30B tag on an RTX 4090, this means you can push to ~100K context without OOM instead of being limited to ~30–40K. Add it to your .bashrc: export OLLAMA_KV_CACHE_TYPE=q8_0

Exact Ollama Pull Commands for Every Tag

Here are the copy-paste ready commands. The model name format is always qwen3-coder:<tag>.

For most users (RTX 4080/5080, 16–20GB VRAM)

ollama pull qwen3-coder:30b

Downloads 19GB. This is the :latest tag and the recommended starting point.

Higher accuracy (RTX 4090 or 32GB+ VRAM)

ollama pull qwen3-coder:30b-a3b-q8_0

Downloads 32GB. ~2% quality improvement over Q4_K_M. Worth it if you have the VRAM headroom.

Full precision research (M4 Max 64GB / dual 4090)

ollama pull qwen3-coder:30b-a3b-fp16

Downloads 61GB. Maximum fidelity to the original model weights. For benchmarking or fine-tune prep.

Cloud-routed 480B (no GPU required)

ollama pull qwen3-coder:480b-cloud

Routes to Ollama's hosted 480B inference. Subject to rate limits and internet dependency.

Run after pulling

ollama run qwen3-coder:30b
ollama run qwen3-coder:30b "Write a Python async web scraper with error handling"

Check VRAM usage after loading

ollama run qwen3-coder:30b --verbose # Linux GPU stats: nvidia-smi # macOS unified memory: # Activity Monitor → Memory tab → GPU memory

Qwen3-Coder-Next: The Hidden Gem

Separate from the main qwen3-coder library, there's also qwen3-coder-next on Ollama — a distinct model, not just a tag variant. Qwen3-Coder-Next is built on the Qwen3-Next-80B-A3B-Base architecture with hybrid attention and MoE. Despite having only 3 billion active parameters per token (out of 80B total), it was specifically agentic-trained at scale with reinforcement learning on verifiable coding tasks.

The headline stat: Qwen3-Coder-Next achieves over 70% on SWE-Bench Verified using the SWE-Agent scaffold, which is genuinely competitive with larger proprietary models. For context, a full 480B Qwen3-Coder reaches similar territory — the Next variant essentially delivers frontier agentic coding performance at a fraction of the active compute cost. On SWE-bench Pro (the harder benchmark using longer multi-turn agentic tasks), it outperforms or matches several models with 10–20× more active parameters.

ollama pull qwen3-coder-next

Qwen3-Coder-Next vs Qwen3-Coder:30b — which should you use? If your primary use case is multi-turn agentic coding (automated bug fixing, repository-level refactoring, tool-calling workflows), Qwen3-Coder-Next is the better pick. Its RL-based agentic training means it handles multi-step planning and tool feedback loops far better than the standard 30B model, which was tuned more for code completion and direct instruction-following.

Qwen3-Coder vs Qwen2.5-Coder: Which to Use in 2026?

The question I see most in the GSC queries: should I switch from Qwen2.5-Coder to Qwen3-Coder on Ollama? The honest answer depends on your VRAM and use case. Here's the direct comparison:

Factor Qwen2.5-Coder:7B Qwen2.5-Coder:32B Qwen3-Coder:30B Qwen3-Coder-Next
Min VRAM ~4.6GB ~19GB ~19GB ~20GB
Context window 128K 128K 256K 256K
HumanEval (official) 88.4% ~92.7% Not separately published Competitive (focus on SWE)
SWE-bench Verified ~50% (community) ~69.6% Matches 32B range 70%+
Architecture Dense Dense MoE (3B active) MoE hybrid (3B active)
Best for 8GB GPUs, daily completions Accuracy-first, 20GB VRAM Long-context (>128K), agentic Multi-turn agents, SWE tasks
Ollama pull command qwen2.5-coder:7b qwen2.5-coder:32b qwen3-coder:30b qwen3-coder-next
Bottom line: If you have 8GB VRAM, Qwen2.5-Coder:7B is still the right call — Qwen3-Coder has no sub-10GB tag. If you have 16–20GB VRAM and need >128K context or multi-turn agentic workflows, Qwen3-Coder:30b or Qwen3-Coder-Next is the clear upgrade. For pure HumanEval completion tasks at the 20GB tier, Qwen2.5-Coder:32B (dense) and Qwen3-Coder:30B (MoE) are approximately neck-and-neck on code quality; the Qwen3 advantage shows in agentic tasks and long context.

Best GGUF Coding Model 2026: Qwen vs DeepSeek vs CodeGemma 14B

When searching for the best gguf coding model 2026 qwen deepseek codegemma 14b, the landscape is competitive. Here is how Qwen3-Coder fits in:

  • Qwen3-Coder (30B MoE): The top choice for context length (256K) and complex agentic tasks. Highly recommended for multi-file generation.
  • DeepSeek-Coder V2: Excellent for logic and math, but often requires more VRAM for its larger mixture-of-experts setups unless heavily quantized.
  • CodeGemma 14B: A highly efficient dense model that is great for standard code completion tasks, but falls short of Qwen's MoE architecture in complex refactoring.

Clarification: Ollama Qwen3.5 Models List 2026

A common search term we see is "ollama qwen3.5 models list 2026". Note that as of june 2026, Qwen3.5 does not exist yet. Users searching for this are likely confusing the older Qwen2.5 series with the latest Qwen3 models detailed on this page. If you want the absolute latest qwen coder model ollama 2026, stick to the Qwen3-Coder tags listed above.

Benchmarks: Qwen3-Coder vs the Field

Benchmark sources: HumanEval figures from official Qwen model card releases. SWE-bench Verified scores from published technical reports (Qwen3-Coder-Next technical report, arXiv 2603.00729). SWE-bench Pro results from the same source. Community comparisons on third-party benchmarking platforms may vary based on prompt format and scaffold used.

Model (Ollama tag) HumanEval ↑ SWE-bench Verified ↑ Context VRAM (Q4)
Qwen3-Coder-Next Strong (see SWE) 70%+ (SWE-Agent scaffold) 256K ~20GB
Qwen2.5-Coder:32B 92.7% (official) ~69.6% (community) 128K ~19GB
Qwen3-Coder:30B Matches 32B range Competitive with 32B 256K ~19GB
Qwen2.5-Coder:14B ~87.3% (official) ~59% (community) 128K ~8.5GB
Qwen2.5-Coder:7B 88.4% (official) ~50% (community) 128K ~4.6GB
DeepSeek-Coder-V2-Lite:16B 90.2% (official) ~56% (community) 64K ~9.2GB

Benchmark conditions vary across evaluations. HumanEval pass@1 is 0-shot. SWE-bench Verified uses the SWE-Agent scaffold unless otherwise noted. Community-reported scores may differ from official model card figures based on prompt formatting.

The key takeaway from these numbers: Qwen3-Coder's biggest lead over Qwen2.5-Coder isn't on HumanEval (where the 32B dense model still holds its own), but on the agentic SWE-bench benchmarks that measure real-world code repair across multi-file repositories. If your workflow involves answering "why is this GitHub issue failing?" rather than "complete this function," Qwen3-Coder's training focus pays off.

Setup Guide: Run Qwen3-Coder in 5 Minutes

Assuming you already have Ollama installed (version 0.5+ recommended). If not, step one covers that.

  1. Install Ollama (skip if already installed):
    curl -fsSL https://ollama.com/install.sh | sh
  2. Set the KV cache env var before anything else:
    export OLLAMA_KV_CACHE_TYPE=q8_0

    This halves context memory overhead — critical for 256K context on consumer hardware. Add to .bashrc or .zshrc to make it permanent.

  3. Pull the 30B model (recommended for 16–20GB VRAM):
    ollama pull qwen3-coder:30b

    19GB download. Allow 10–30 minutes depending on your internet speed.

  4. Run a quick test:
    ollama run qwen3-coder:30b "Write a Python async function that retries HTTP requests with exponential backoff"
  5. Connect to VS Code via Continue.dev:

    Install Continue extension → Settings → Provider: Ollama → Model: qwen3-coder:30b → Set context length: 32768 (or higher if VRAM allows).

  6. Verify memory usage:
    nvidia-smi # Linux # macOS: Activity Monitor → Memory → GPU Memory Used
Pro tip for agentic workflows: Pair Qwen3-Coder:30B with Aider for repository-level coding. Qwen3-Coder was explicitly trained on agentic data and handles Aider's multi-file edit format well. Use: aider --model ollama/qwen3-coder:30b

Decision Tree: Which Qwen3-Coder Tag for Your Hardware?

  • You have 8GB VRAM or less: Qwen3-Coder is not for you yet. Use qwen2.5-coder:7b — it's still excellent at 88.4% HumanEval on 4.6GB VRAM.
  • You have 16GB VRAM (RTX 4080, RTX 5080): qwen3-coder:30b with CPU offload. Set OLLAMA_KV_CACHE_TYPE=q8_0 and keep context under 64K for smooth generation.
  • You have 20–24GB VRAM (RTX 4090, A5000): qwen3-coder:30b fits fully in VRAM. You can push to ~128K context comfortably.
  • You have 32GB+ VRAM (dual 3090 / A6000 / M4 Max 64GB): qwen3-coder:30b-a3b-q8_0 — the quality bump is worth it at this tier.
  • You prioritize agentic multi-turn tasks over raw code completion: qwen3-coder-next — same VRAM footprint as 30B, better SWE-bench scores.
  • You need 480B without buying GPUs: qwen3-coder:480b-cloud — cloud-routed, no download, rate-limited.
  • You need maximum HumanEval scores at 20GB VRAM: qwen2.5-coder:32b (dense) is still competitive here — Qwen3-Coder's wins are on agentic/SWE benchmarks, not raw pass@1.
🛠️ Recommended VS Code setup (20GB VRAM): Use qwen3-coder-next as your chat/agentic model via Continue.dev, and keep qwen2.5-coder:7b running in parallel as the autocomplete model. You get the agentic quality of the Next variant for complex tasks while the 7B handles fast inline completions without latency.

Frequently Asked Questions