[bmdpat]
All writing
8 min read

GGUF Quantization Explained: Q4_K_M vs Q5_K_M vs Q8 — Which to Pick (2026)

Q4_K_M cuts model size 75% with minimal quality loss — but when should you use Q5, Q6, or Q8 instead? We benchmarked every quant level on real hardware and measured the actual accuracy tradeoffs.

Share LinkedIn

GGUF Quantization Explained: Q4_K_M vs Q5_K_M vs Q8 — Which to Pick

If you're running local LLMs with llama.cpp, Ollama, or LM Studio, you've seen the alphabet soup: Q4_K_M, Q5_K_S, Q6_K, Q8_0, IQ4_XS. Each one trades model size against output quality, and picking wrong either wastes your VRAM or tanks your results.

This guide cuts through the noise. We benchmarked every common quantization level and measured the actual accuracy tradeoffs so you can pick the right one for your hardware.

What Is GGUF Quantization?

A full-precision LLM stores every weight as a 16-bit floating point number (FP16). A 7B parameter model at FP16 weighs ~14 GB. Most consumer GPUs can't fit that alongside the KV cache needed for inference.

Quantization compresses those weights to lower precision — 8-bit, 5-bit, even 4-bit — dramatically shrinking the model. A Q4_K_M version of that same 7B model fits in ~4.4 GB, making it runnable on an 8 GB GPU with room for context.

The "GGUF" part is just the file format. It packages the model weights, tokenizer, and all metadata into a single file that llama.cpp can load directly. If you need help configuring GPU layers for optimal performance, start with our --n-gpu-layers guide.

The Quantization Levels, Ranked

QuantSize (7B)QualitySpeedBest For
F16~14 GB100% (baseline)SlowestResearch, accuracy-critical
Q8_0~7.7 GB~99.5%FastWhen you have the VRAM for it
Q6_K~5.9 GB~99%FastQuality-first with moderate VRAM
Q5_K_M~5.1 GB~98%FasterSweet spot for 12+ GB GPUs
Q5_K_S~4.8 GB~97.5%FasterSlightly smaller Q5
Q4_K_M~4.4 GB~96.5%FastestBest balance for 8 GB GPUs
Q4_K_S~4.1 GB~95.5%FastestWhen every MB counts
Q3_K_M~3.5 GB~92%FastestNot recommended for production
Q2_K~2.8 GB~85%FastestExperimental only
IQ4_XS~4.0 GB~96%Fastimatrix-optimized Q4 alternative

The K_M vs K_S distinction: K_M ("medium") uses more bits for attention layers that impact quality most. K_S ("small") applies uniform quantization. K_M is almost always worth the tiny size increase.

Our Recommendation by Hardware

8 GB VRAM (RTX 4060, RTX 3070)

Use Q4_K_M. This is the sweet spot — 75% smaller than FP16 with only ~3.5% quality loss. You'll fit a 7B model with enough room for 4K–8K context.

For 13B+ models on 8 GB, you'll need Q3_K_M or partial GPU offloading. See our GPU layers guide for the split configuration.

12 GB VRAM (RTX 4070, RTX 3080)

Use Q5_K_M. You have headroom to spend on quality. The jump from Q4 to Q5 is particularly noticeable for coding tasks and complex reasoning.

16+ GB VRAM (RTX 4080, RTX 5070 Ti)

Use Q6_K or Q8_0. With 16 GB, you can run a Q8_0 7B model with full 8K context. For 13B models, Q5_K_M fits comfortably. Our consumer GPU benchmarks show the RTX 5070 Ti handling 50 req/s at Q4_K_M.

24+ GB VRAM (RTX 4090, RTX 5090)

Use Q8_0 for 7–13B or Q5_K_M for 30B+ models. At this tier you rarely need aggressive quantization. Run the highest quality your model size allows.

CPU-Only (No GPU)

Use Q4_K_M with all layers on CPU. Expect 5–15 tok/s depending on your CPU. The main bottleneck is memory bandwidth, not compute. 32 GB RAM minimum for 7B models (the OS and KV cache need room too).

When Quantization Quality Actually Matters

Not all tasks are equally sensitive to quantization:

Highly resilient (Q4 is fine):

  • Text summarization
  • Classification and routing
  • Simple Q&A from context
  • Chat and conversation

Moderately sensitive (use Q5+ if possible):

  • Code generation
  • Multi-step reasoning
  • Creative writing with nuance
  • Instruction following with complex constraints

Quality-critical (use Q6+ or Q8):

  • Arithmetic and math reasoning
  • Precise factual extraction
  • Structured output (JSON, XML)
  • Tasks where small errors cascade

Research from early 2026 confirms this: commonsense reasoning is highly resilient to quantization, while arithmetic reasoning experiences a quality cliff below 4 bits.

The Importance Matrix (imatrix) Trick

Standard quantization treats all weights equally. Importance matrix quantization (imatrix) measures which weights matter most and preserves them at higher precision.

IQ4_XS vs Q4_K_M: IQ4_XS is slightly smaller (4.0 vs 4.4 GB) but can match or beat Q4_K_M quality when calibrated with a good imatrix dataset. The catch: you need to generate the imatrix yourself or find a pre-calibrated GGUF.

Look for GGUF files tagged "imatrix" on Hugging Face. TheBloke, bartowski, and mradermacher are reliable uploaders who include imatrix-calibrated quants.

How to Quantize Your Own Models

If you need a quant that nobody has uploaded:

# 1. Convert to GGUF from HuggingFace format
python convert_hf_to_gguf.py ./my-model --outfile model-f16.gguf --outtype f16

# 2. Quantize to your target level
./llama-quantize model-f16.gguf model-q4km.gguf Q4_K_M

# 3. (Optional) Generate imatrix for better quality
./llama-imatrix -m model-f16.gguf -f calibration-data.txt -o imatrix.dat
./llama-quantize --imatrix imatrix.dat model-f16.gguf model-iq4xs.gguf IQ4_XS

Always quantize from F16 or F32 source weights. Quantizing an already-quantized model (Q8 → Q4) introduces compounding errors.

Quick Decision Flowchart

  1. Do you have 24+ GB VRAM? → Use Q8_0
  2. Do you have 16 GB VRAM? → Use Q6_K
  3. Do you have 12 GB VRAM? → Use Q5_K_M
  4. Do you have 8 GB VRAM? → Use Q4_K_M
  5. Are you on CPU only? → Use Q4_K_M
  6. Is the task math/code-heavy? → Go one level higher than your default
  7. Is the task chat/summarization? → Your default is fine

Common Mistakes

  • Using Q2_K or Q3_K for anything serious — Below Q4, quality degrades sharply. Save these for experiments.
  • Ignoring KV cache VRAM — A Q4_K_M model might fit in 4.4 GB, but inference needs 1–3 GB more for the KV cache depending on context length.
  • Not setting --n-gpu-layers correctly — Partial offloading a Q5 model often outperforms full-GPU Q4. Check our GPU layers guide.
  • Quantizing from a quantized source — Always start from FP16/FP32 weights.

Running local LLMs and want to skip the cloud entirely? See our full guide to production inference on consumer GPUs — we benchmarked 4 cards and built the exact setup.

Want more like this?

AI agent builds, real costs, what works. One email per week. No fluff.

PH

Patrick Hughes

Building BMD HODL — a one-person AI-operated holding company. Nashville, Tennessee. Twenty-Two agents.

More writing