[bmdpat]
All writing
5 min read

GGUF Quantization and VRAM: How to Pick Q4, Q5, or Q8 for Your GPU (2026)

VRAM decides your GGUF quant, not vibes. How I assign Q4, Q5, Q8 across an 8GB 3070, 16GB 5070 Ti, and 32GB 5090.

Share LinkedIn

If you have read my breakdown of Q4, Q5, and Q8 GGUF quants, you know which quant levels mean what. This post goes one layer deeper. It is about the thing that actually decides your quant choice: VRAM. And because most of us are fitting these models so an agent can call them all day, VRAM is only half the bill. The other half is runtime cost, which I will come back to.

I run Llama 3.1 8B on three cards. An RTX 3070 with 8GB. A 5070 Ti with 16GB. A 5090 with 32GB. The same model behaves very differently on each, and the quant level is the lever that makes it fit or fail.

The VRAM math nobody writes down

A GGUF file size is roughly your weights budget. For Llama 3.1 8B:

  • Q4_K_M lands near 4.9GB on disk.
  • Q5_K_M lands near 5.7GB.
  • Q8_0 lands near 8.5GB.

But the file size is not the whole story. You also pay for the KV cache and a compute buffer. The KV cache grows with context length. At 8K context on an 8B model you are looking at roughly 1GB extra. At 32K context that climbs past 4GB.

So the real VRAM bill is weights plus KV cache plus a few hundred MB of overhead. On my 8GB 3070, Q8 is dead on arrival. The weights alone eat 8.5GB before context. Q4_K_M fits with room for an 8K window. That is the whole reason Q4 exists.

This matters more for agents than for chat. An agent calls the model in a loop, often with a growing context. The KV cache you sized for a single turn balloons across a multi-step run. If you size your quant with no headroom, the agent crashes mid-task when context grows. Pick the quant that leaves slack for the loop, not just the first prompt.

Where each quant actually makes sense

Here is how I assign them across my cards.

8GB card (3070): Q4_K_M only. It fits Llama 3.1 8B with an 8K context and leaves headroom for an agent loop. Push to Q5 and you start spilling layers to CPU, which tanks tokens per second. Stay at Q4.

16GB card (5070 Ti): Q5_K_M or Q6_K. Now you have slack. Q5_K_M gives you better output quality than Q4 with no speed penalty because everything still lives on the GPU. You can also run a 32K context here, which is what an agent with tool history actually needs.

32GB card (5090): Q8_0, or a bigger model. At 32GB the 8B model at Q8 is trivial. The smarter move is to step up to a 13B or 14B model at Q5 and use the extra brains for harder agent reasoning. Quant level stops being the constraint and model size takes over.

The quality cliff is not where you think

People assume Q4 is a big quality drop. It is not, for most work. On 8B models the perplexity gap between Q4_K_M and Q8_0 is small. You feel it on hard reasoning and long code generation, not on summaries or chat.

The real cliff is below Q4. Q3 and Q2 quants on an 8B model start to wander. Repetition. Dropped instructions. In an agent loop that is fatal, because one dropped instruction cascades into a broken multi-step run. If you are tempted by Q2 to fit a model, get a bigger card or a smaller model instead.

A quick test I run on every new quant

Before I trust a quant for agent work, I run the same three prompts:

  1. A 40-line Python refactor. Checks code coherence.
  2. A multi-step instruction with a format constraint. Checks instruction following, which is what agents live or die on.
  3. A long summary of a 3K-word doc. Checks context handling.

If a quant passes all three, it is good enough for my agent workloads. Q4_K_M on Llama 3.1 8B passes all three on the 3070. That is my daily driver.

The takeaway, and the half of the bill VRAM does not cover

Pick your quant from your VRAM, not from a vibe. Measure the weights, add the KV cache for your context length, leave 10 percent headroom for the loop. If it fits on the GPU, you keep full speed. If it spills, you lose more from CPU offload than you ever gain from a higher quant.

But once the model fits and the agent is running all day, the cost moves from VRAM to runtime. A local model is cheap per token, yet an agent that loops without limits can still burn hours of GPU time and thousands of tokens on a task that should have stopped. Fitting the model is step one. Bounding the loop is step two.

That second step is exactly what I built AgentGuard for. It is a runtime budget and rate limiter that stops an agent before it runs away on tokens, time, or spend, whether you are on a local GGUF model or a hosted API. One line to install, one hard ceiling to set. Once your quant fits the card, put a ceiling on the loop. Check it out at https://bmdpat.com/tools/agentguard.

Want more like this?

AI agent builds, real costs, what works. M-F only when there is something worth sending. No fluff.

PH

Patrick Hughes

Building BMD HODL — a one-person AI-operated holding company. Nashville, Tennessee. Twenty-Two agents.

More writing