What is the best GGUF quantization for local LLMs?

Q4_K_M is the default pick from this tool's reference table: 97.5% of F16, 40GB for the 70B reference, and marked as a sweet spot. Q5_K_M is the quality-leaning sweet spot at 98.5% when it fits. Q8_0 is the near-lossless option at 99.5% when memory matters less than fidelity.

How much quality do you lose with Q4_K_M?

Q4_K_M is 97.5% vs F16, so the reference loss is 2.5%. The 70B reference size drops from 140GB to 40GB, saving 100GB (71.4%).

Should I use Q5_K_M or Q8_0?

Use Q5_K_M when you want the higher sweet-spot quality proxy (98.5%) and it fits. Use Q8_0 when you have enough VRAM and want the near-lossless row (99.5%) more than the smaller download. For most default local runs, compare both against Q4_K_M.

Local LLM Toolkit

See the GGUF quality tradeoff before you download.

Compare IQ2_XXS, IQ3_XXS, IQ4_XS, Q4_0, Q4_K_M, Q5_K_M, Q6_K, Q8_0, and F16 by size, quality, speed, and GPU fit. Pick the smallest file that still keeps the model useful.

Get the local-LLM digest

Get GGUF picks, GPU-fit notes, and benchmark updates. M-F, only when there is something worth sending.

Single opt-in for the local-LLM newsletter. Unsubscribe anytime. Privacy.

Compare

Model familyGPUVRAM (GB)

5/5 free runs left today

What I would pick

Q4_K_M

Q4_K_M needs CPU offload, but it preserves more quality than the ultra-small quants.

Size

40GB

Quality

97.5%

GPU layers

46/80

Speed

8.3-16 tok/s

Open in VRAM calculator Read the deep dive

Needs more VRAM

Run it on a cloud GPU Buy RTX 4090

Some links are affiliate links. If you buy or rent through them I may earn a commission at no extra cost to you.

GGUF tradeoff curve

Generic 70B on RTX 4090 24GB

Q4_K_M to Q5_K_M sweet spot

Quant level	Size	Quality vs F16	Speed boost	VRAM saved	Verdict
IQ2_XXS	20GB	87%	2.4x	120GB (85.7%)	Last resort. Small, but the model changes.
IQ3_XXS	28GB	93%	2.3x	112GB (80%)	Quality loss is noticeable. Use for fit.
IQ4_XS	36GB	96.5%	2.1x	104GB (74.3%)	Compact pick when Q4_K_M is too tight.
Q4_0	37GB	96%	2x	103GB (73.6%)	Older Q4. Prefer Q4_K_M when available.
Q4_K_MSweet spot	40GB	97.5%	2x	100GB (71.4%)	Default sweet spot for most local runs.
Q5_K_MSweet spot	47GB	98.5%	1.8x	93GB (66.4%)	Quality pick. Worth it when it fits.
Q6_K	54GB	99%	1.6x	86GB (61.4%)	High quality. Still a large download.
Q8_0	70GB	99.5%	1.4x	70GB (50%)	Near lossless. Good when memory is abundant.
F16	140GB	100%	1x	0GB (0%)	Native quality. Usually too large for local GPUs.

Default model

Generic 70B

Sweet spot

Q4_K_M to Q5_K_M

Recent usage

0 tracked runs / 30d

FAQ

What is the best GGUF quantization for local LLMs?: Q4_K_M is the default pick from this tool's reference table: 97.5% of F16, 40GB for the 70B reference, and marked as a sweet spot. Q5_K_M is the quality-leaning sweet spot at 98.5% when it fits. Q8_0 is the near-lossless option at 99.5% when memory matters less than fidelity.
How much quality do you lose with Q4_K_M?: Q4_K_M is 97.5% vs F16, so the reference loss is 2.5%. The 70B reference size drops from 140GB to 40GB, saving 100GB (71.4%).
Should I use Q5_K_M or Q8_0?: Use Q5_K_M when you want the higher sweet-spot quality proxy (98.5%) and it fits. Use Q8_0 when you have enough VRAM and want the near-lossless row (99.5%) more than the smaller download. For most default local runs, compare both against Q4_K_M.

§ 002 / PRICING

Unlimited local LLM decisions with Pro.

The toolkit is free for up to 5 free runs per tool per day. Upgrade to Pro to remove the limit and keep your rig history in one place.

Free

5 free runs per tool per day
Standard GPU presets

Pro

$7/mo

Unlimited calculator runs
Save my rig and get new-fit alerts
Import custom models from Hugging Face URLs
Benchmark history across model and quant choices
Early access to new toolkit surfaces
No ads

Go Pro

Or $49/year