Local LLM Toolkit
See the GGUF quality tradeoff before you download.
Compare Q4, Q5, Q8, and the IQ quants by size, quality, speed, and GPU fit. Pick the smallest file that still keeps the model useful.
Compare
5/5 free runs left today
What I would pick
Q4_K_M
Q4_K_M needs CPU offload, but it preserves more quality than the ultra-small quants.
Size
40GB
Quality
97.5%
GPU layers
46/80
Speed
8.3-16 tok/s
GGUF tradeoff curve
Generic 70B on RTX 4090 24GB
Q4 to Q5 sweet spot
| Quant level | Size | Quality vs F16 | Speed boost | VRAM saved | Verdict |
|---|---|---|---|---|---|
| IQ2_XXS | 20GB | 87% | 2.4x | 120GB (85.7%) | Last resort. Small, but the model changes. |
| IQ3_XXS | 28GB | 93% | 2.3x | 112GB (80%) | Quality loss is noticeable. Use for fit. |
| IQ4_XS | 36GB | 96.5% | 2.1x | 104GB (74.3%) | Compact pick when Q4_K_M is too tight. |
| Q4_0 | 37GB | 96% | 2x | 103GB (73.6%) | Older Q4. Prefer Q4_K_M when available. |
| Q4_K_MSweet spot | 40GB | 97.5% | 2x | 100GB (71.4%) | Default sweet spot for most local runs. |
| Q5_K_MSweet spot | 47GB | 98.5% | 1.8x | 93GB (66.4%) | Quality pick. Worth it when it fits. |
| Q6_K | 54GB | 99% | 1.6x | 86GB (61.4%) | High quality. Still a large download. |
| Q8_0 | 70GB | 99.5% | 1.4x | 70GB (50%) | Near lossless. Good when memory is abundant. |
| F16 | 140GB | 100% | 1x | 0GB (0%) | Native quality. Usually too large for local GPUs. |
Default model
70B baseline
Sweet spot
Q4_K_M to Q5_K_M
Logged event
tool_use
FAQ
- What is the best GGUF quantization for local LLMs?
- Q4_K_M is the default pick for most local runs. Q5_K_M is better when you have enough VRAM. Q8_0 is useful when quality matters more than size.
- How much quality do you lose with Q4_K_M?
- For many models, Q4_K_M lands around 97% to 98% of the F16 baseline in practical use. It is the main sweet spot because the size drops hard while quality usually holds.
- Should I use Q5_K_M or Q8_0?
- Use Q5_K_M when you want strong quality on one prosumer GPU. Use Q8_0 when you have enough VRAM and want a near-lossless local copy.
Want more like this?
AI agent builds, real costs, what works. One email per week. No fluff.