Local LLM Toolkit
VRAM calculator for local LLMs.
Pick a model, quant, context window, and GPU. Get the rough `--n-gpu-layers` value before you waste an evening on trial and error.
Calculator
5/5 free runs left today
CPU offload
53/80 layers on GPU
35.7GB required24GB available
Full GPU load needs about 11.7GB more VRAM.
GPU VRAM used
23.8GB
Model weights
35GB
KV cache
0.2GB
Speed estimate
9.2-17.6 tok/s
llama.cpp flag
--n-gpu-layers 53The model can run with CPU offload. Expect lower throughput and more system RAM pressure.
Default target
llama.cpp
Best use
GGUF planning
Logged event
tool_use
FAQ
- How much VRAM do I need for Llama 3 70B?
- A 70B model usually needs about 35GB for Q4 weights before runtime overhead and KV cache. A 24GB card can run it with partial GPU offload, but full GPU residency usually needs a 48GB class card or multiple GPUs.
- What does --n-gpu-layers do in llama.cpp?
- `--n-gpu-layers` controls how many transformer layers llama.cpp keeps on the GPU. Higher values are faster when they fit in VRAM. Lower values spill more work to CPU and system RAM.
- Can I run a 70B model on 24GB VRAM?
- Yes, but usually not fully in VRAM. Use a Q4 or smaller quantization, offload as many layers as fit, and expect CPU offload to reduce tokens per second.
Want more like this?
AI agent builds, real costs, what works. One email per week. No fluff.