[bmdpat]

Local LLM Toolkit

See the GGUF quality tradeoff before you download.

Compare IQ2_XXS, IQ3_XXS, IQ4_XS, Q4_0, Q4_K_M, Q5_K_M, Q6_K, Q8_0, and F16 by size, quality, speed, and GPU fit. Pick the smallest file that still keeps the model useful.

Get the local-LLM digest

Get GGUF picks, GPU-fit notes, and benchmark updates. M-F, only when there is something worth sending.

Single opt-in for the local-LLM newsletter. Unsubscribe anytime. Privacy.

Compare

5/5 free runs left today

What I would pick

Q4_K_M

Q4_K_M needs CPU offload, but it preserves more quality than the ultra-small quants.

Size

40GB

Quality

97.5%

GPU layers

46/80

Speed

8.3-16 tok/s

Needs more VRAM

Some links are affiliate links. If you buy or rent through them I may earn a commission at no extra cost to you.

GGUF tradeoff curve

Generic 70B on RTX 4090 24GB

Q4_K_M to Q5_K_M sweet spot
Quant levelSizeQuality vs F16Speed boostVRAM savedVerdict
IQ2_XXS20GB87%2.4x120GB (85.7%)Last resort. Small, but the model changes.
IQ3_XXS28GB93%2.3x112GB (80%)Quality loss is noticeable. Use for fit.
IQ4_XS36GB96.5%2.1x104GB (74.3%)Compact pick when Q4_K_M is too tight.
Q4_037GB96%2x103GB (73.6%)Older Q4. Prefer Q4_K_M when available.
Q4_K_MSweet spot40GB97.5%2x100GB (71.4%)Default sweet spot for most local runs.
Q5_K_MSweet spot47GB98.5%1.8x93GB (66.4%)Quality pick. Worth it when it fits.
Q6_K54GB99%1.6x86GB (61.4%)High quality. Still a large download.
Q8_070GB99.5%1.4x70GB (50%)Near lossless. Good when memory is abundant.
F16140GB100%1x0GB (0%)Native quality. Usually too large for local GPUs.

Default model

Generic 70B

Sweet spot

Q4_K_M to Q5_K_M

Recent usage

0 tracked runs / 30d

FAQ

What is the best GGUF quantization for local LLMs?
Q4_K_M is the default pick from this tool's reference table: 97.5% of F16, 40GB for the 70B reference, and marked as a sweet spot. Q5_K_M is the quality-leaning sweet spot at 98.5% when it fits. Q8_0 is the near-lossless option at 99.5% when memory matters less than fidelity.
How much quality do you lose with Q4_K_M?
Q4_K_M is 97.5% vs F16, so the reference loss is 2.5%. The 70B reference size drops from 140GB to 40GB, saving 100GB (71.4%).
Should I use Q5_K_M or Q8_0?
Use Q5_K_M when you want the higher sweet-spot quality proxy (98.5%) and it fits. Use Q8_0 when you have enough VRAM and want the near-lossless row (99.5%) more than the smaller download. For most default local runs, compare both against Q4_K_M.

§ 002 / PRICING

Unlimited local LLM decisions with Pro.

The toolkit is free for up to 5 free runs per tool per day. Upgrade to Pro to remove the limit and keep your rig history in one place.

Free

$0

  • 5 free runs per tool per day
  • Standard GPU presets

Pro

$7/mo

  • Unlimited calculator runs
  • Save my rig and get new-fit alerts
  • Import custom models from Hugging Face URLs
  • Benchmark history across model and quant choices
  • Early access to new toolkit surfaces
  • No ads
Go Pro

Or $49/year