How to Pick a GGUF Quant Level for Your VRAM Budget
Given your GPU, which GGUF quant do you actually pick? The VRAM math, a card-by-card table, and the quality tradeoff in plain terms.
How to Pick a GGUF Quant Level for Your VRAM Budget
A while back I wrote a breakdown of GGUF quantization: Q4 vs Q5 vs Q8. That post explains what the quant levels mean. This one answers the next question people ask: given my GPU, which one do I actually pick?
The short version. Fit the model in VRAM first. Then climb to the highest quant that still fits with room to spare. Here is the actual math so you can do it yourself.
The one formula you need
Model size in VRAM is roughly:
params (billions) x bits-per-weight / 8 = GB
A 7B model at Q4 (about 4.5 bits per weight in practice) is:
7 x 4.5 / 8 = ~3.9 GB
The same 7B at Q8 (about 8.5 bits) is:
7 x 8.5 / 8 = ~7.4 GB
That is the weights only. You also need room for the KV cache and some overhead. Add 15 to 25 percent on top. So plan for the Q8 7B to want roughly 9 GB total at a normal context length.
Match the quant to the card
Here is the practical table I use. These assume you want the whole model on the GPU, not split to CPU.
8 GB card (3070, 4060): a 7B fits comfortably at Q4 or Q5. Q8 is tight once you add context. Stick to Q5_K_M for the best quality that still fits.
12 GB card (3060 12GB, 4070): 7B at Q8 fits fine. A 13B fits at Q4 or Q5. This is the sweet spot for one mid-size model at high quality.
16 GB card (4070 Ti Super, 4060 Ti 16GB): 13B at Q5 or Q8. A 7B at Q8 leaves tons of room for a long context.
24 GB card (3090, 4090): 13B at Q8 with a big context, or a 34B at Q4. This is where you stop worrying about quants for small models.
32 GB card (5090): a 34B at Q5 or Q8, or two 13B models loaded at once. At this point quant choice is about speed, not fitting.
The quality-vs-size tradeoff in plain terms
People obsess over this more than they should. Here is what actually matters.
Q8 is near lossless. If it fits, take it and stop thinking.
Q5_K_M is the value pick. Quality loss is small enough that most people cannot tell in normal use. It saves a lot of VRAM over Q8.
Q4_K_M is fine for chat and drafting. You will notice it more on hard reasoning and code. Use it when Q5 does not fit.
Below Q4, quality drops fast. Only go there when you have no other way to fit the model. A bigger model at Q4 usually beats a smaller model at Q8.
That last point is the one people miss. A 13B at Q4 generally beats a 7B at Q8. More parameters at lower precision wins over fewer parameters at high precision, up to a point. So fit the biggest model your card allows, then pick the quant.
A quick way to test before you commit
Do not trust a table blindly. Run it. Pull two quants of the same model and compare on your own prompts.
# load Q5 and watch VRAM with nvidia-smi in another terminal ./llama-cli -m mistral-7b-q5_k_m.gguf -p "your real prompt here" -ngl 99 # then the Q8 ./llama-cli -m mistral-7b-q8_0.gguf -p "your real prompt here" -ngl 99
Watch two things. Did it fit fully on the GPU, and does the output quality hold up on your actual work. If Q8 fits and the speed is acceptable, you are done. If it spills to CPU and slows to a crawl, drop to Q5.
When you run this in production
Local quants are cheap to run but not free. You still pay in GPU time, electricity, and the engineer hours when a context overflow crashes a job at 2 AM. If you are wiring a local model into an agent that calls it in a loop, put a budget around the loop so a runaway agent does not peg your GPU all night.
That is what I built AgentGuard for. It is an open-source runtime budget, token, and rate limiter for AI agents, and it works the same whether the model is a cloud API or a local GGUF on your own card. pip install agentguard, wrap your agent loop, set a cap, and you stop the runaway before it costs you a night of compute.
Pick the model first, the quant second, and test on your own prompts. The card decides the ceiling. You decide the rest.
Want more like this?
AI agent builds, real costs, what works. M-F only when there is something worth sending. No fluff.
Patrick Hughes
Building BMD HODL — a one-person AI-operated holding company. Nashville, Tennessee. Twenty-Two agents.
More writing
- 6 min
How to Tune --n-gpu-layers for Your VRAM Budget
How to actually pick --n-gpu-layers: the offload math, finding the number with nvidia-smi, multi-GPU splits, and the top OOM mistakes.
- 7 min
llama.cpp Multi-GPU: Splitting a Model Across Cards with --tensor-split
Split a 70B model across multiple GPUs with llama.cpp. How --tensor-split, --main-gpu, and --split-mode work on a real consumer rig.
- 6 min
llama.cpp ngl: when -ngl 99 still runs on your CPU
You set -ngl 99 and llama.cpp still runs on your CPU. The flag is fine. Here is the 30-second load-log diagnostic and the five real causes, ranked.