How much VRAM do I need for Llama 3 70B?

Llama 3.3 70B at Q4_K_M weighs about 35GB before runtime overhead and KV cache. With the calculator's 4K tokens default context, the full estimate is 35.7GB. A 24GB card can run it with partial GPU offload, but full GPU residency usually needs a 48GB class card, unified-memory machine, or multiple GPUs.

What does --n-gpu-layers do in llama.cpp?

`--n-gpu-layers` controls how many transformer layers llama.cpp keeps on the GPU. Higher values are faster when they fit in VRAM. Lower values spill more work to CPU and system RAM.

Can I run a 70B model on 24GB VRAM?

Yes, but usually not fully in VRAM. On RTX 4090 24GB, the current calculator estimate offloads 53/80 layers and emits `--n-gpu-layers 53` at the default 4K tokens. Use Q4_K_M or smaller, offload as many layers as fit, and expect CPU offload to reduce tokens per second.

Local LLM Toolkit

VRAM calculator for local LLMs.

Pick a model, quant, context window, and GPU. The default GGUF check estimates 53/80 GPU layers for Llama 3.3 70B at Q4_K_M on RTX 4090 24GB, then gives you --n-gpu-layers 53 before trial and error.

Calculator

ModelQuantizationGPU

VRAM (GB)Context

5/5 free runs left today

CPU offload

53/80 layers on GPU

35.7GB required24GB available

Full GPU load needs about 11.7GB more VRAM.

GPU VRAM used

23.8GB

Model weights

35GB

KV cache

0.2GB

Speed estimate

9.2-17.6 tok/s

llama.cpp flag

--n-gpu-layers 53

The model can run with CPU offload. Expect lower throughput and more system RAM pressure.

Read the GPU layers guide Guard the agent you run locally

Needs more VRAM

Run it on a cloud GPU Buy RTX 4090

Some links are affiliate links. If you buy or rent through them I may earn a commission at no extra cost to you.

Get the local-LLM cost digest

New GGUF quant benchmarks, VRAM math, and what actually runs on consumer GPUs. M-F, only when there is something worth sending.

Single opt-in for the local-LLM newsletter. Unsubscribe anytime. Privacy.

Default target

llama.cpp

Scope

7 model presets / 11 GPUs / 9 quant levels

Recent usage

0 tracked runs / 30d

FAQ

How much VRAM do I need for Llama 3 70B?: Llama 3.3 70B at Q4_K_M weighs about 35GB before runtime overhead and KV cache. With the calculator's 4K tokens default context, the full estimate is 35.7GB. A 24GB card can run it with partial GPU offload, but full GPU residency usually needs a 48GB class card, unified-memory machine, or multiple GPUs.
What does --n-gpu-layers do in llama.cpp?: `--n-gpu-layers` controls how many transformer layers llama.cpp keeps on the GPU. Higher values are faster when they fit in VRAM. Lower values spill more work to CPU and system RAM.
Can I run a 70B model on 24GB VRAM?: Yes, but usually not fully in VRAM. On RTX 4090 24GB, the current calculator estimate offloads 53/80 layers and emits `--n-gpu-layers 53` at the default 4K tokens. Use Q4_K_M or smaller, offload as many layers as fit, and expect CPU offload to reduce tokens per second.

§ 002 / PRICING

Unlimited local LLM decisions with Pro.

The toolkit is free for up to 5 free runs per tool per day. Upgrade to Pro to remove the limit and keep your rig history in one place.

Free

5 free runs per tool per day
Standard GPU presets

Pro

$7/mo

Unlimited calculator runs
Save my rig and get new-fit alerts
Import custom models from Hugging Face URLs
Benchmark history across model and quant choices
Early access to new toolkit surfaces
No ads

Go Pro

Or $49/year

Get the local AI lab notes

Benchmark rows, VRAM fit checks, quant choices, and what actually runs on consumer GPUs. M-F, only when there is something worth sending.