June 8, 2026Updated July 22, 20267 min read

llama.cpp --tensor-split: Run a 70B Across 2 GPUs (2026)

Your 70B model won't fit on one GPU. Here's how llama.cpp --tensor-split, --main-gpu, and --split-mode spread it across two cards on a real consumer rig.

#llama.cpp #local-llm #gpu #multi-gpu #inference

Share LinkedIn

TL;DR

--tensor-split takes ratio weights, not gigabytes. For a 24 GB card plus a 16 GB card, 24,16 and 3,2 mean the same 60/40 split.
Point --main-gpu at your largest card. It holds the KV cache, which grows with context and can OOM a small card even after a clean load.
--split-mode layer, the default, is the right choice on consumer hardware. Each GPU owns a contiguous block of layers.

Get the weekly 5090 Reports

If you are still tuning a single card, start here first: llama.cpp n-gpu-layers explained. That post covers how --n-gpu-layers moves layers onto one GPU and the VRAM math behind it.

This one is the next step. Once a model no longer fits on one card, you split it across several. A 70B model at Q4 needs about 40 GB on disk plus a few GB for the KV cache. No single consumer card has that much VRAM. But three cards together do. This is where --tensor-split comes in.

Most people who go multi-GPU are not running a one-off prompt. They are standing up a local model to feed an agent: a loop that calls the model over and over to read, plan, and act. That changes what matters. A bigger model on more cards is the capacity side of the problem. The other side is keeping that agent loop from running the rig hot all night on a task that should have stopped. Both sides need a plan before you load the weights. We will set up the hardware here and come back to the loop control AgentGuard gives you once the model is running.

The core idea

With one GPU, you only decide how many layers go on the card. With multiple GPUs, you decide how the layers get divided between them. llama.cpp loads the model once and spreads the weights across every GPU you point it at. Each card holds a slice. During inference, activations pass from one card to the next over the PCIe bus.

Three flags control this:

--tensor-split sets the ratio of the model that lands on each GPU.
--main-gpu picks which card holds the KV cache and coordinates the run.
--split-mode chooses how the work is divided: by layer or by row.

--tensor-split: the ratio

--tensor-split takes a comma-separated list, one number per GPU. The numbers are weights, not gigabytes. llama.cpp normalizes them into proportions.

Say you have a 24 GB card and a 16 GB card. You want the bigger card to hold more of the model. A split of 24,16 puts 60 percent on GPU 0 and 40 percent on GPU 1. The exact integers do not matter, only the ratio. 24,16 and 3,2 do the same thing.

The goal is to match the split to each card's free VRAM. If you overload the smaller card, you get an out-of-memory crash mid-load. Leave headroom on whichever card holds the KV cache, because that buffer grows with context length.

--main-gpu and the KV cache

One card has to hold the KV cache and run the orchestration. That is --main-gpu. It defaults to GPU 0. Point it at your largest card so the KV cache has room to grow as the context fills up.

A long context can add several gigabytes to the main GPU on top of its model slice. If your main GPU is also your smallest card, you will hit OOM at high context even though the model loaded fine. Put the cache on the big card.

--split-mode: layer vs row

--split-mode layer is the default and the right choice on consumer hardware. Each GPU owns a contiguous block of layers. Cross-card traffic happens only at the layer boundaries, so PCIe bandwidth stays low.

--split-mode row splits individual tensors across cards. It can be faster, but only when the GPUs talk over a fast link like NVLink. Consumer cards do not have NVLink. On a plain PCIe rig, row mode floods the bus and usually runs slower than layer mode. Stick with layer.

A worked example on a real rig

Here is a mixed consumer setup: an RTX 5090 (32 GB), an RTX 5070 Ti (16 GB), and an RTX 3070 (8 GB). Total is about 56 GB of VRAM. That is enough to run a 70B model at Q4 fully on GPU, with the ~40 GB of weights spread across all three cards and room left for the cache.

GPU	Index	VRAM	Split weight	Role
RTX 5090	0	32 GB	32	main-gpu, holds KV cache
RTX 5070 Ti	1	16 GB	16	model slice
RTX 3070	2	8 GB	6	model slice (leave headroom)

Note the 3070 gets a weight of 6, not 8. Leave a margin on the smallest card so a context spike does not push it over.

The command:

./llama-cli \
  --model models/llama-3.1-70b-instruct-Q4_K_M.gguf \
  --n-gpu-layers 999 \
  --tensor-split 32,16,6 \
  --main-gpu 0 \
  --split-mode layer \
  --ctx-size 8192 \
  --prompt "Explain tensor parallelism in one paragraph."

--n-gpu-layers 999 still means "all layers on GPU." The difference now is that --tensor-split decides which GPU each layer lands on. Watch the load logs. llama.cpp prints how much VRAM it assigns to each device. If one card is near its limit, adjust the ratio down and rerun.

When multi-GPU helps and when it hurts

Multi-GPU helps when a model does not fit on your largest single card. Splitting a 70B across three cards turns "impossible" into "runs at usable speed." That is the whole point.

It hurts when the model already fits on one card. Splitting it then adds PCIe round-trips between cards for no benefit, and throughput drops. If your model fits on the 5090 alone, run it there and leave the other cards free.

Expect throughput to scale below linear. Three cards do not give you 3x the speed of one. Activations still hop across PCIe between layer blocks, and that adds latency. The win is capacity, not raw speed. You are trading some tokens per second for the ability to run a much larger model at all. A 70B at Q4 across this rig lands in a usable interactive range, slower than an 8B on a single card but far smarter.

That slower throughput is exactly where an agent loop gets dangerous. A 70B across three cards might do 10 to 15 tokens a second. An agent that retries, re-reads context, and re-plans can run for hours and you would not notice until the room is warm. The model is free to run, but the time and power are not, and a stuck loop produces nothing. This is the operational cost of local inference: not a per-token bill, but wasted hours on a runaway task. AgentGuard sits around the calls and enforces a hard ceiling on tokens and wall-clock spend, so the loop stops itself instead of grinding until morning. Set the cap before you point an agent at a multi-GPU model, not after the first overnight surprise.

The short version

Single-card --n-gpu-layers is step one: how many layers fit on one GPU. --tensor-split is how you scale past one card: the ratio of the model each GPU holds. Set the ratio to match free VRAM, put --main-gpu on your largest card so the KV cache has room, and keep --split-mode layer unless you have NVLink. For the broader picture on running local models in production, see local LLM inference on consumer GPUs.

That gets the model running. The capacity side is solved. Before you wire the model into an agent, solve the loop side too: put AgentGuard around the calls so a runaway agent on your slow-but-smart 70B stops at a budget instead of burning the whole night.

FAQ

What does tensor-split do in llama.cpp?

tensor-split tells llama.cpp how to divide model tensors across multiple GPUs. The split should reflect available VRAM and leave room for runtime overhead.

When should I use multiple GPUs for llama.cpp?

Use multiple GPUs when one card cannot fit the model and context you need. Verify performance because split models can be slower if PCIe transfer becomes the bottleneck.

Get the Local AI Field Kit

Four copy-ready tools now, then measured local AI field notes M-F only when there is something worth sending.

Free. One-click unsubscribe. No sponsored placements. Your email is used only for these notes.

Patrick Hughes

Building BMD HODL — a one-person AI-operated holding company. Nashville, Tennessee. Twenty-Two agents.

llama.cpp --tensor-split: Run a 70B Across 2 GPUs (2026)

The core idea

--tensor-split: the ratio

--main-gpu and the KV cache

--split-mode: layer vs row

A worked example on a real rig

When multi-GPU helps and when it hurts

The short version

FAQ

What does tensor-split do in llama.cpp?

When should I use multiple GPUs for llama.cpp?

Get the Local AI Field Kit

More writing

How to Close the AI Agent Cost Gap at the Call Site

Will That Local Model Fit? Do the VRAM Math First

llama.cpp -ngl 99 Still on CPU? 5 Fixes, Ranked (2026)

Q4_K_M vs Q5_K_M: Which GGUF Quant?

llama.cpp --n-gpu-layers: -1, 0, Partial