Local LLM Toolkit
VRAM calculator for local LLMs.
Pick a model, quant, context window, and GPU. The default GGUF check estimates 53/80 GPU layers for Llama 3.3 70B at Q4_K_M on RTX 4090 24GB, then gives you --n-gpu-layers 53 before trial and error.
Calculator
5/5 free runs left today
CPU offload
53/80 layers on GPU
Full GPU load needs about 11.7GB more VRAM.
GPU VRAM used
23.8GB
Model weights
35GB
KV cache
0.2GB
Speed estimate
9.2-17.6 tok/s
--n-gpu-layers 53The model can run with CPU offload. Expect lower throughput and more system RAM pressure.
Needs more VRAM
Some links are affiliate links. If you buy or rent through them I may earn a commission at no extra cost to you.
Get the local-LLM cost digest
New GGUF quant benchmarks, VRAM math, and what actually runs on consumer GPUs. M-F, only when there is something worth sending.
Single opt-in for the local-LLM newsletter. Unsubscribe anytime. Privacy.
Default target
llama.cpp
Scope
7 model presets / 11 GPUs / 9 quant levels
Recent usage
0 tracked runs / 30d
FAQ
- How much VRAM do I need for Llama 3 70B?
- Llama 3.3 70B at Q4_K_M weighs about 35GB before runtime overhead and KV cache. With the calculator's 4K tokens default context, the full estimate is 35.7GB. A 24GB card can run it with partial GPU offload, but full GPU residency usually needs a 48GB class card, unified-memory machine, or multiple GPUs.
- What does --n-gpu-layers do in llama.cpp?
- `--n-gpu-layers` controls how many transformer layers llama.cpp keeps on the GPU. Higher values are faster when they fit in VRAM. Lower values spill more work to CPU and system RAM.
- Can I run a 70B model on 24GB VRAM?
- Yes, but usually not fully in VRAM. On RTX 4090 24GB, the current calculator estimate offloads 53/80 layers and emits `--n-gpu-layers 53` at the default 4K tokens. Use Q4_K_M or smaller, offload as many layers as fit, and expect CPU offload to reduce tokens per second.
§ 002 / PRICING
Unlimited local LLM decisions with Pro.
The toolkit is free for up to 5 free runs per tool per day. Upgrade to Pro to remove the limit and keep your rig history in one place.
Free
$0
- 5 free runs per tool per day
- Standard GPU presets
Pro
$7/mo
- Unlimited calculator runs
- Save my rig and get new-fit alerts
- Import custom models from Hugging Face URLs
- Benchmark history across model and quant choices
- Early access to new toolkit surfaces
- No ads
Or $49/year
Get the local AI lab notes
Benchmark rows, VRAM fit checks, quant choices, and what actually runs on consumer GPUs. M-F, only when there is something worth sending.