May 13, 2026Updated July 14, 20268 min read

GGUF Q4 vs Q5 vs Q8: which quant should you pick?

Pick the largest GGUF quant that fits your model, context length, and KV cache with VRAM headroom. Q4 is the fit-first choice, Q5 or Q6 is the balance point, and Q8 is for quality when memory is available.

Compare Q4_K_M vs Q5_K_M vs Q8 GGUF: VRAM fit, quality, speed, and when a smaller quant beats a bigger one that falls back to CPU.

#llama-cpp #gguf #quantization #local-ai #gpu #llm

Share LinkedIn

TL;DR

Q4_K_M is the best balance for 8 GB GPUs: a 7B model fits in about 4.4 GB at roughly 96.5% of baseline quality. Q5_K_M is the sweet spot at 12 GB, Q6_K or Q8_0 at 16 GB and up.
K_M variants spend more bits on the attention layers that matter most for quality. K_M is almost always worth the small size increase over K_S.
A smaller quant that stays fully on GPU usually beats a bigger quant that spills layers to CPU.

Get the weekly 5090 Reports

Related guides

If you're running local LLMs with llama.cpp, Ollama, or LM Studio, you've seen the alphabet soup: Q4_K_M, Q5_K_S, Q6_K, Q8_0, IQ4_XS. Each one trades model size against output quality, and picking wrong either wastes your VRAM or tanks your results.

This guide cuts through the noise. We benchmarked every common quantization level and measured the actual accuracy tradeoffs so you can pick the right one for your hardware.

What Is GGUF Quantization?

A full-precision LLM stores every weight as a 16-bit floating point number (FP16). A 7B parameter model at FP16 weighs ~14 GB. Most consumer GPUs can't fit that alongside the KV cache needed for inference.

Quantization compresses those weights to lower precision - 8-bit, 5-bit, even 4-bit - dramatically shrinking the model. A Q4_K_M version of that same 7B model fits in ~4.4 GB, making it runnable on an 8 GB GPU with room for context.

The "GGUF" part is just the file format. It packages the model weights, tokenizer, and all metadata into a single file that llama.cpp can load directly. If you need help configuring GPU layers for optimal performance, start with our --n-gpu-layers guide.

The Quantization Levels, Ranked

Quant	Size (7B)	Quality	Speed	Best For
F16	~14 GB	100% (baseline)	Slowest	Research, accuracy-critical
Q8_0	~7.7 GB	~99.5%	Fast	When you have the VRAM for it
Q6_K	~5.9 GB	~99%	Fast	Quality-first with moderate VRAM

| Q5_K_M | ~5.1 GB | ~98% | Faster | Sweet spot for 12+ GB GPUs | | Q5_K_S | ~4.8 GB | ~97.5% | Faster | Slightly smaller Q5 | | Q4_K_M | ~4.4 GB | ~96.5% | Fastest | Best balance for 8 GB GPUs | | Q4_K_S | ~4.1 GB | ~95.5% | Fastest | When every MB counts | | Q3_K_M | ~3.5 GB | ~92% | Fastest | Not recommended for production |

| Q2_K | ~2.8 GB | ~85% | Fastest | Experimental only | | IQ4_XS | ~4.0 GB | ~96% | Fast | imatrix-optimized Q4 alternative |

The K_M vs K_S distinction: K_M ("medium") uses more bits for attention layers that impact quality most. K_S ("small") applies uniform quantization. K_M is almost always worth the tiny size increase.

Our Recommendation by Hardware

8 GB VRAM (RTX 4060, RTX 3070)

Use Q4_K_M. This is the sweet spot - 75% smaller than FP16 with only ~3.5% quality loss. You'll fit a 7B model with enough room for 4K-8K context.

For 13B+ models on 8 GB, you'll need Q3_K_M or partial GPU offloading. See our GPU layers guide for the split configuration.

12 GB VRAM (RTX 4070, RTX 3080)

Use Q5_K_M. You have headroom to spend on quality. The jump from Q4 to Q5 is particularly noticeable for coding tasks and complex reasoning.

16+ GB VRAM (RTX 4080, RTX 5070 Ti)

Use Q6_K or Q8_0. With 16 GB, you can run a Q8_0 7B model with full 8K context. For 13B models, Q5_K_M fits comfortably. Our consumer GPU benchmarks show the RTX 5070 Ti handling 50 req/s at Q4_K_M.

24+ GB VRAM (RTX 4090, RTX 5090)

Use Q8_0 for 7-13B or Q5_K_M for 30B+ models. At this tier you rarely need aggressive quantization. Run the highest quality your model size allows.

CPU-Only (No GPU)

Use Q4_K_M with all layers on CPU. Expect 5-15 tok/s depending on your CPU. The main bottleneck is memory bandwidth, not compute. 32 GB RAM minimum for 7B models (the OS and KV cache need room too).

When Quantization Quality Actually Matters

Not all tasks are equally sensitive to quantization:

Highly resilient (Q4 is fine):

Text summarization
Classification and routing
Simple Q&A from context
Chat and conversation

Moderately sensitive (use Q5+ if possible):

Code generation
Multi-step reasoning
Creative writing with nuance
Instruction following with complex constraints

Quality-critical (use Q6+ or Q8):

Arithmetic and math reasoning
Precise factual extraction
Structured output (JSON, XML)
Tasks where small errors cascade

Research from early 2026 confirms this: commonsense reasoning is highly resilient to quantization, while arithmetic reasoning experiences a quality cliff below 4 bits.

The Importance Matrix (imatrix) Trick

Standard quantization treats all weights equally. Importance matrix quantization (imatrix) measures which weights matter most and preserves them at higher precision.

IQ4_XS vs Q4_K_M: IQ4_XS is slightly smaller (4.0 vs 4.4 GB) but can match or beat Q4_K_M quality when calibrated with a good imatrix dataset. The catch: you need to generate the imatrix yourself or find a pre-calibrated GGUF.

Look for GGUF files tagged "imatrix" on Hugging Face. TheBloke, bartowski, and mradermacher are reliable uploaders who include imatrix-calibrated quants.

How to Quantize Your Own Models

If you need a quant that nobody has uploaded:

## 1. Convert to GGUF from HuggingFace format
python convert_hf_to_gguf.py ./my-model --outfile model-f16.gguf --outtype f16

## 2. Quantize to your target level
./llama-quantize model-f16.gguf model-q4km.gguf Q4_K_M

## 3. (Optional) Generate imatrix for better quality
./llama-imatrix -m model-f16.gguf -f calibration-data.txt -o imatrix.dat
./llama-quantize --imatrix imatrix.dat model-f16.gguf model-iq4xs.gguf IQ4_XS

Always quantize from F16 or F32 source weights. Quantizing an already-quantized model (Q8 → Q4) introduces compounding errors.

Quick Decision Flowchart

Prefer to see the trade-offs side by side? Our quant comparison tool compares Q4/Q5/Q6/Q8 on size, quality, and VRAM fit for your exact GPU.

Do you have 24+ GB VRAM? → Use Q8_0
Do you have 16 GB VRAM? → Use Q6_K
Do you have 12 GB VRAM? → Use Q5_K_M
Do you have 8 GB VRAM? → Use Q4_K_M
Are you on CPU only? → Use Q4_K_M
Is the task math/code-heavy? → Go one level higher than your default
Is the task chat/summarization? → Your default is fine

Common Mistakes

Using Q2_K or Q3_K for anything serious - Below Q4, quality degrades sharply. Save these for experiments.
Ignoring KV cache VRAM - A Q4_K_M model might fit in 4.4 GB, but inference needs 1-3 GB more for the KV cache depending on context length.
Not setting --n-gpu-layers correctly - Partial offloading a Q5 model often outperforms full-GPU Q4. Check our GPU layers guide.
Quantizing from a quantized source - Always start from FP16/FP32 weights.

Running local LLMs and want to skip the cloud entirely? See our full guide to production inference on consumer GPUs - we benchmarked 4 cards and built the exact setup.

Every prompt we publish also lives in the free prompt library. Searchable, copy-paste ready, and updated with every new post.

Accompanying prompt

What the prompt does: Recommends the GGUF quant level that best fits your VRAM, model, and quality tolerance.

Copy/paste this prompt:

Copy-ready prompt

Paste the exact block into your coding agent.

No article chrome, no footnotes, no formatting drift.

Role: You are choosing a GGUF quantization level for local inference. My setup: - GPU and VRAM: [e.g. 16GB 5070 Ti] - Model: [e.g. Qwen2.5-32B] - Priority: [max quality / max speed / just fit in VRAM] - Context length I need: [e.g. 8k] Task: 1. Estimate the on-disk and in-VRAM size of this model at Q4_K_M, Q5_K_M, Q6_K, and Q8_0. 2. Tell me which of those fit my VRAM with room for the KV cache at my context length. 3. Recommend one quant and say what I give up versus the next step up. 4. Note any case where a smaller model at higher quant beats this model at low quant. Constraints: - Be specific about numbers, not vibes. - Account for context and KV cache, not just weights. - Do not recommend a quant that will not load on my card.

19 lines743 chars

Ready

This prompt and every other one we publish live in the free prompt library.

Copy the block above.

Compare quant sizes and quality side by side in the quant tool: https://bmdpat.com/tools/quant-compare

Free tools and the list

A wrong quant is a one-time load mistake. An agent that retries a broken local job overnight is a cost mistake. After you pick Q4_K_M or Q5_K_M, put a runtime budget on the loop so token burn cannot run unbounded. AgentGuard is the budget and rate limiter for that path.

Weekly measured runs on owned hardware: 5090 Reports

FAQ

Should I use Q4, Q5, or Q8 GGUF?

Use Q4 when VRAM is tight, Q5 for balance, and Q8 when you have enough memory and want less quality loss. Always leave room for KV cache.

Why can a smaller GGUF quant run better?

A smaller quant can stay fully on GPU with more context headroom. A larger quant may spill to CPU or reduce usable context, which hurts latency.

Get the Local AI Field Kit

Four copy-ready tools now, then measured local AI field notes M-F only when there is something worth sending.

Free. One-click unsubscribe. No sponsored placements. Your email is used only for these notes.

Patrick Hughes

Building BMD HODL — a one-person AI-operated holding company. Nashville, Tennessee. Twenty-Two agents.