llama.cpp n_gpu_layers Explained: What -1, 0, and 99 Actually Do
What does n_gpu_layers mean in llama.cpp? How -1 offloads everything to GPU, why 0 keeps you on CPU, and how to pick the right number for your VRAM. Includes benchmarks and VRAM math.
llama.cpp n_gpu_layers Explained: What -1, 0, and 99 Actually Do
If you have ever tried running a local LLM with llama.cpp, you have seen this flag:
--n-gpu-layers 99
Or maybe you set it to -1, or left it at 0, and got wildly different performance. This post explains what n_gpu_layers actually controls, how to calculate the right value for your hardware, and what happens when you get it wrong.
What n_gpu_layers Controls
Large language models are built from stacked transformer blocks. A 7B parameter model like Llama 3.1 8B has 32 of these blocks (called "layers"). Each layer contains attention weights and feed-forward weights that consume GPU memory (VRAM) when loaded.
The --n-gpu-layers flag (or -ngl for short) tells llama.cpp how many of those layers to move from system RAM onto your GPU. The more layers on the GPU, the faster inference runs, because GPUs process matrix math orders of magnitude faster than CPUs.
Here is what the common values mean:
| Value | What It Does |
|---|---|
0 | All layers stay on CPU. No GPU used at all. |
1 to N | That many layers load onto GPU. Rest stay on CPU. |
99 or 999 | Attempts to load all layers onto GPU. Excess is ignored. |
-1 | Special value: offload every layer to GPU. Same effect as 999. |
Setting -ngl 99 and -ngl -1 do the same thing in practice. Both tell llama.cpp to put as much as possible on the GPU. The difference is cosmetic. Most people use 99 or 999 because it is more explicit.
How Many Layers Does Your Model Have?
The number of layers depends on the model architecture:
| Model | Parameters | Layers |
|---|---|---|
| Llama 3.1 8B | 8B | 32 |
| Llama 3.1 70B | 70B | 80 |
| Mistral 7B | 7B | 32 |
| Phi-3 Mini | 3.8B | 32 |
| SmolLM2 1.7B | 1.7B | 24 |
| Qwen2 72B | 72B | 80 |
You can check your model's layer count in the GGUF metadata. When you load a model, llama.cpp prints something like llama.block_count = 32 in the startup logs. That is your total layer count.
Where is your AI budget leaking?
Free snapshot. No credentials. Results in minutes.
VRAM Math: How to Pick the Right Number
The formula is straightforward. Take your model file size (the GGUF file on disk) and divide by the number of layers. That gives you a rough per-layer VRAM cost.
Example: Llama 3.1 8B Q4_K_M (4.9 GB file, 32 layers)
- Per layer: ~153 MB
- 16 layers on GPU: ~2.4 GB VRAM
- All 32 layers on GPU: ~4.9 GB VRAM
- Plus overhead (KV cache, context buffer): ~1-2 GB extra
So a 6 GB GPU (like an RTX 2060) can fit all 32 layers of this model with room for a small context window. A 16 GB GPU (like an RTX 5070 Ti) can fit it with a 32K context window and still have headroom.
Example: Llama 3.1 70B Q4_K_M (~40 GB file, 80 layers)
- Per layer: ~500 MB
- 16 GB GPU: fits about 28-30 layers (after context overhead)
- 24 GB GPU (RTX 4090): fits about 42-44 layers
- The rest spills to system RAM
When layers spill to CPU, you get "partial offloading." The GPU handles some layers, the CPU handles others, and data shuttles back and forth over PCIe. This is slower than full GPU offloading but still faster than CPU-only.
Performance: CPU vs Partial vs Full GPU
Real benchmark data from a Ryzen 5900X + RX 7900 XT running SmolLM2 1.7B (source: SteelPh0enix's llama.cpp guide):
| Configuration | Prompt Processing | Text Generation |
|---|---|---|
| CPU only (0 layers) | 165 tok/s | 22 tok/s |
| Full GPU offload | 880 tok/s | 90 tok/s |
That is a 5.3x speedup for prompt processing and a 4x speedup for text generation. The difference between 22 tokens per second and 90 tokens per second is the difference between "painfully slow" and "usable in production."
For larger models on more powerful hardware, NVIDIA reports the RTX 4090 hitting roughly 150 tokens per second on Llama 3 8B int4 with full GPU offloading (100-token input, 100-token output sequences).
Partial offloading falls somewhere in between. If you can only fit 20 out of 32 layers on GPU, expect roughly 60-70% of full GPU performance. The exact number depends on which layers are offloaded and how fast your PCIe bus moves data.
Common Problems and Fixes
"I set n_gpu_layers to 99 but it is still slow"
Check that your llama.cpp build actually has GPU support compiled in. If you built from source without CUDA, Vulkan, or Metal flags, the flag gets silently ignored. Run llama-cli --help and look for GPU-related options. If they are missing, rebuild with the right backend.
On Linux with NVIDIA:
cmake -B build -DGGML_CUDA=ON cmake --build build --config Release
On Mac with Metal (M-series):
cmake -B build -DGGML_METAL=ON cmake --build build --config Release
"Out of memory" errors
Lower the value. If -ngl 32 crashes, try -ngl 24 or -ngl 16. You can also reduce the context size with --ctx-size 2048 to free up VRAM for more layers. Or use a more aggressively quantized model (Q3_K_M instead of Q4_K_M).
"n_gpu_layers -1 does not work on AMD"
Some AMD GPU users on older ROCm versions have reported that -ngl -1 does not trigger offloading. The workaround: use a large explicit number like --n-gpu-layers 999 instead.
Shared VRAM on Windows (integrated graphics)
On Windows systems with shared VRAM, llama.cpp might report using your GPU but actually consume ~20 GB of system RAM with minimal speed benefit. This is because shared memory goes through the CPU memory bus anyway. If you have a dedicated GPU, make sure llama.cpp is targeting it, not the integrated graphics.
Practical Recommendations
If your model fits entirely in VRAM:
Use -ngl -1 or -ngl 999. Full offloading gives the best performance.
If your model is too large for your GPU: Count backward. Check the GGUF file size, subtract 1-2 GB for context overhead, and divide the remaining VRAM by per-layer size. That is your layer count.
For example, with 8 GB VRAM and a 4.9 GB model:
- Available after overhead: ~6.5 GB
- 4.9 GB / 32 layers = 153 MB per layer
- 6.5 GB / 153 MB = ~42 layers (more than the model has, so use -1)
With 8 GB VRAM and a 40 GB model:
- Available after overhead: ~6.5 GB
- 40 GB / 80 layers = 500 MB per layer
- 6.5 GB / 500 MB = ~13 layers
- Use
-ngl 13
If you have no dedicated GPU: Leave it at 0 and accept CPU-only speeds. On a modern CPU with AVX2, expect 1-3 tokens per second for 7B models. It works, but it is not fast enough for real-time applications.
Running It
Here is a concrete command to serve Llama 3.1 8B with full GPU offloading as an OpenAI-compatible API:
./llama-server \ --model models/llama-3.1-8b-q4_k_m.gguf \ --n-gpu-layers -1 \ --ctx-size 8192 \ --host 0.0.0.0 \ --port 8080
This loads all layers onto your GPU, sets an 8K context window, and exposes an API endpoint you can hit with any OpenAI-compatible client. If you run out of VRAM, drop -ngl to something like 24 and reduce --ctx-size to 4096.
For a deeper look at running this in production (including hardware selection, benchmarks on an RTX 5070 Ti, and when local inference actually makes financial sense), see the full guide: Local LLM Inference on Consumer GPUs.
FAQ
What does n_gpu_layers = -1 mean in llama.cpp? It tells llama.cpp to offload all transformer layers to the GPU. It is equivalent to setting the value to a very large number like 999. If your model fits in VRAM, this gives maximum performance.
What happens if I set n_gpu_layers higher than the model's layer count? Nothing bad. llama.cpp caps the value at the actual layer count. Setting 999 on a 32-layer model just loads 32 layers. The extra is ignored.
How much VRAM do I need for full offloading? Roughly the size of the GGUF file on disk, plus 1-2 GB for context and overhead. A 4.9 GB Q4_K_M model needs about 6-7 GB total. A 40 GB Q4_K_M model needs about 42-44 GB total.
Is partial GPU offloading worth it? Yes. Even offloading 50% of layers to GPU gives a significant speedup over CPU-only. The relationship is roughly linear: half the layers on GPU gives roughly half the GPU-only speedup.
Does n_gpu_layers affect model quality or accuracy? No. It only changes where computation happens (CPU vs GPU). The math is identical either way. Your outputs will be the same regardless of the setting.
<script type="application/ld+json"> { "@context": "https://schema.org", "@type": "FAQPage", "mainEntity": [ { "@type": "Question", "name": "What does n_gpu_layers = -1 mean in llama.cpp?", "acceptedAnswer": { "@type": "Answer", "text": "It tells llama.cpp to offload all transformer layers to the GPU. It is equivalent to setting the value to a very large number like 999. If your model fits in VRAM, this gives maximum performance." } }, { "@type": "Question", "name": "What happens if I set n_gpu_layers higher than the model's layer count?", "acceptedAnswer": { "@type": "Answer", "text": "Nothing bad. llama.cpp caps the value at the actual layer count. Setting 999 on a 32-layer model just loads 32 layers. The extra is ignored." } }, { "@type": "Question", "name": "How much VRAM do I need for full GPU offloading in llama.cpp?", "acceptedAnswer": { "@type": "Answer", "text": "Roughly the size of the GGUF file on disk, plus 1-2 GB for context and overhead. A 4.9 GB Q4_K_M model needs about 6-7 GB total." } }, { "@type": "Question", "name": "Does n_gpu_layers affect model quality or accuracy?", "acceptedAnswer": { "@type": "Answer", "text": "No. It only changes where computation happens (CPU vs GPU). The math is identical either way. Your outputs will be the same regardless of the setting." } } ] } </script>Ready to automate?
I build AI agents and automated workflows. Async delivery. No meetings. Flat rate.
Start a ProjectNeed help with Azure or AI agents?
I help Azure-based teams reduce cloud spend, govern AI agents, and automate workflows. Fixed-scope, async delivery, zero meetings.
Get new posts delivered to your inbox
No spam. Unsubscribe anytime.
More from the blog
Aymo AI Pricing in 2026: Plans, Limits, and What You Actually Get
Aymo AI pricing starts at $0 and tops out at $39/month. Here is what each plan includes, where the limits bite, and how it compares to ChatGPT, Claude Pro, and building your own AI stack.
Building a Local Voice AI on Raspberry Pi 5: What Actually Works in 2026
I built a voice assistant on a Raspberry Pi 5 that runs entirely offline. No cloud, no subscriptions, no data leaving the device. Here's what worked, what didn't, and whether it's worth your time.
A 9-Person Startup Replaced Its Dev Team With AI Agents. Here's the Part That Actually Matters.
JustPaid ran 7 AI agents 24/7 using OpenClaw and Claude Code and shipped 10 features in a month. Their first bill was $4,000 a week. Here is what that number tells you.