llama.cpp Multi-GPU: Splitting a Model Across Cards with --tensor-split
Split a 70B model across multiple GPUs with llama.cpp. How --tensor-split, --main-gpu, and --split-mode work on a real consumer rig.
If you are still tuning a single card, start here first: llama.cpp n-gpu-layers explained. That post covers how --n-gpu-layers moves layers onto one GPU and the VRAM math behind it.
This one is the next step. Once a model no longer fits on one card, you split it across several. A 70B model at Q4 needs about 40 GB on disk plus a few GB for the KV cache. No single consumer card has that much VRAM. But three cards together do. This is where --tensor-split comes in.
Most people who go multi-GPU are not running a one-off prompt. They are standing up a local model to feed an agent: a loop that calls the model over and over to read, plan, and act. That changes what matters. A bigger model on more cards is the capacity side of the problem. The other side is keeping that agent loop from running the rig hot all night on a task that should have stopped. Both sides need a plan before you load the weights. We will set up the hardware here and come back to the loop control AgentGuard gives you once the model is running.
The core idea
With one GPU, you only decide how many layers go on the card. With multiple GPUs, you decide how the layers get divided between them. llama.cpp loads the model once and spreads the weights across every GPU you point it at. Each card holds a slice. During inference, activations pass from one card to the next over the PCIe bus.
Three flags control this:
--tensor-splitsets the ratio of the model that lands on each GPU.--main-gpupicks which card holds the KV cache and coordinates the run.--split-modechooses how the work is divided: by layer or by row.
--tensor-split: the ratio
--tensor-split takes a comma-separated list, one number per GPU. The numbers are weights, not gigabytes. llama.cpp normalizes them into proportions.
Say you have a 24 GB card and a 16 GB card. You want the bigger card to hold more of the model. A split of 24,16 puts 60 percent on GPU 0 and 40 percent on GPU 1. The exact integers do not matter, only the ratio. 24,16 and 3,2 do the same thing.
The goal is to match the split to each card's free VRAM. If you overload the smaller card, you get an out-of-memory crash mid-load. Leave headroom on whichever card holds the KV cache, because that buffer grows with context length.
--main-gpu and the KV cache
One card has to hold the KV cache and run the orchestration. That is --main-gpu. It defaults to GPU 0. Point it at your largest card so the KV cache has room to grow as the context fills up.
A long context can add several gigabytes to the main GPU on top of its model slice. If your main GPU is also your smallest card, you will hit OOM at high context even though the model loaded fine. Put the cache on the big card.
--split-mode: layer vs row
--split-mode layer is the default and the right choice on consumer hardware. Each GPU owns a contiguous block of layers. Cross-card traffic happens only at the layer boundaries, so PCIe bandwidth stays low.
--split-mode row splits individual tensors across cards. It can be faster, but only when the GPUs talk over a fast link like NVLink. Consumer cards do not have NVLink. On a plain PCIe rig, row mode floods the bus and usually runs slower than layer mode. Stick with layer.
A worked example on a real rig
Here is a mixed consumer setup: an RTX 5090 (32 GB), an RTX 5070 Ti (16 GB), and an RTX 3070 (8 GB). Total is about 56 GB of VRAM. That is enough to run a 70B model at Q4 fully on GPU, with the ~40 GB of weights spread across all three cards and room left for the cache.
| GPU | Index | VRAM | Split weight | Role |
|---|---|---|---|---|
| RTX 5090 | 0 | 32 GB | 32 | main-gpu, holds KV cache |
| RTX 5070 Ti | 1 | 16 GB | 16 | model slice |
| RTX 3070 | 2 | 8 GB | 6 | model slice (leave headroom) |
Note the 3070 gets a weight of 6, not 8. Leave a margin on the smallest card so a context spike does not push it over.
The command:
./llama-cli \ --model models/llama-3.1-70b-instruct-Q4_K_M.gguf \ --n-gpu-layers 999 \ --tensor-split 32,16,6 \ --main-gpu 0 \ --split-mode layer \ --ctx-size 8192 \ --prompt "Explain tensor parallelism in one paragraph."
--n-gpu-layers 999 still means "all layers on GPU." The difference now is that --tensor-split decides which GPU each layer lands on. Watch the load logs. llama.cpp prints how much VRAM it assigns to each device. If one card is near its limit, adjust the ratio down and rerun.
When multi-GPU helps and when it hurts
Multi-GPU helps when a model does not fit on your largest single card. Splitting a 70B across three cards turns "impossible" into "runs at usable speed." That is the whole point.
It hurts when the model already fits on one card. Splitting it then adds PCIe round-trips between cards for no benefit, and throughput drops. If your model fits on the 5090 alone, run it there and leave the other cards free.
Expect throughput to scale below linear. Three cards do not give you 3x the speed of one. Activations still hop across PCIe between layer blocks, and that adds latency. The win is capacity, not raw speed. You are trading some tokens per second for the ability to run a much larger model at all. A 70B at Q4 across this rig lands in a usable interactive range, slower than an 8B on a single card but far smarter.
That slower throughput is exactly where an agent loop gets dangerous. A 70B across three cards might do 10 to 15 tokens a second. An agent that retries, re-reads context, and re-plans can run for hours and you would not notice until the room is warm. The model is free to run, but the time and power are not, and a stuck loop produces nothing. This is the operational cost of local inference: not a per-token bill, but wasted hours on a runaway task. AgentGuard sits around the calls and enforces a hard ceiling on tokens and wall-clock spend, so the loop stops itself instead of grinding until morning. Set the cap before you point an agent at a multi-GPU model, not after the first overnight surprise.
The short version
Single-card --n-gpu-layers is step one: how many layers fit on one GPU. --tensor-split is how you scale past one card: the ratio of the model each GPU holds. Set the ratio to match free VRAM, put --main-gpu on your largest card so the KV cache has room, and keep --split-mode layer unless you have NVLink. For the broader picture on running local models in production, see local LLM inference on consumer GPUs.
That gets the model running. The capacity side is solved. Before you wire the model into an agent, solve the loop side too: put AgentGuard around the calls so a runaway agent on your slow-but-smart 70B stops at a budget instead of burning the whole night.
Want more like this?
AI agent builds, real costs, what works. M-F only when there is something worth sending. No fluff.
Patrick Hughes
Building BMD HODL — a one-person AI-operated holding company. Nashville, Tennessee. Twenty-Two agents.
More writing
- 6 min
How to Tune --n-gpu-layers for Your VRAM Budget
How to actually pick --n-gpu-layers: the offload math, finding the number with nvidia-smi, multi-GPU splits, and the top OOM mistakes.
- 6 min
How to Pick a GGUF Quant Level for Your VRAM Budget
Given your GPU, which GGUF quant do you actually pick? The VRAM math, a card-by-card table, and the quality tradeoff in plain terms.
- 6 min
llama.cpp ngl: when -ngl 99 still runs on your CPU
You set -ngl 99 and llama.cpp still runs on your CPU. The flag is fine. Here is the 30-second load-log diagnostic and the five real causes, ranked.