[bmdpat]

Local LLM Toolkit

VRAM calculator for local LLMs.

Pick a model, quant, context window, and GPU. Get the rough `--n-gpu-layers` value before you waste an evening on trial and error.

Calculator

5/5 free runs left today

CPU offload

53/80 layers on GPU

35.7GB required24GB available

Full GPU load needs about 11.7GB more VRAM.

GPU VRAM used

23.8GB

Model weights

35GB

KV cache

0.2GB

Speed estimate

9.2-17.6 tok/s

llama.cpp flag
--n-gpu-layers 53

The model can run with CPU offload. Expect lower throughput and more system RAM pressure.

Default target

llama.cpp

Best use

GGUF planning

Logged event

tool_use

FAQ

How much VRAM do I need for Llama 3 70B?
A 70B model usually needs about 35GB for Q4 weights before runtime overhead and KV cache. A 24GB card can run it with partial GPU offload, but full GPU residency usually needs a 48GB class card or multiple GPUs.
What does --n-gpu-layers do in llama.cpp?
`--n-gpu-layers` controls how many transformer layers llama.cpp keeps on the GPU. Higher values are faster when they fit in VRAM. Lower values spill more work to CPU and system RAM.
Can I run a 70B model on 24GB VRAM?
Yes, but usually not fully in VRAM. Use a Q4 or smaller quantization, offload as many layers as fit, and expect CPU offload to reduce tokens per second.

Want more like this?

AI agent builds, real costs, what works. One email per week. No fluff.

Get The One-Person Holdco (free PDF)

How one human plus twenty-two AI agents runs a seven-pillar portfolio with no employees.