Localmaxxing isn't theory. Here's what my 3-GPU rig actually does.
Tom Tunguz called it localmaxxing. I run a 3070 + 5070 Ti + 5090 in one box and serve Llama 3.1 8B locally every day. Here are the real tokens-per-second, the real watts, and the real cost per million tokens.
Tom Tunguz wrote a post this week called Localmaxxing. His thesis: open-weight models on prosumer hardware now match cloud-tier quality for a sliver of the cost. The gap closed. The math flipped.
I've been running this setup for months. RTX 3070, RTX 5070 Ti, RTX 5090, all in one tower, serving Llama 3.1 8B through llama.cpp. So let me skip the thesis and put real numbers on the table.
The rig
One Threadripper box. Three GPUs. 80GB of total VRAM if you stack them, though I don't pool them for a single 8B model. I run Llama 3.1 8B in Q5_K_M quant. That fits comfortably on the 5090 alone with room to spare for a 32k context window.
The 3070 and 5070 Ti run smaller models in parallel for different agent jobs. Embeddings on one, a 3B classifier on another. The 5090 is the workhorse.
Tokens per second on Llama 3.1 8B
On the 5090, Q5_K_M, single batch, no flash-attention tweaks beyond defaults:
- Prompt processing: ~3,200 tok/s
- Generation: ~140 tok/s sustained
For comparison, Claude Opus and GPT-4-class APIs land around 30-80 tok/s on generation depending on load. My local 8B is faster than the frontier cloud APIs for raw throughput. It's a smaller model, so output quality is lower for hard reasoning. For 80% of agent work (classify, extract, summarize, route, format), it's plenty.
Cost per million tokens
Cloud reference points:
- GPT-4o: ~$5 input / $15 output per million
- Claude Sonnet 4.5: ~$3 input / $15 output per million
- Llama 3.1 8B on Together / Fireworks: ~$0.20 per million blended
My local cost, including amortized hardware and Texas electricity at $0.11/kWh:
The 5090 pulls about 400W under sustained inference. At 140 tok/s, one hour of generation produces 504,000 tokens for 0.4 kWh, or about 4.4 cents. That's $0.087 per million output tokens. Round it to 9 cents.
Hardware amortization is the bigger line. Call it $2,200 for the 5090 over 3 years of mixed use. If the card pulls 1,000 hours of inference per year, that's $0.73 per hour, or about $1.45 per million tokens.
Total all-in: roughly $1.55 per million output tokens on local, versus $15 on Claude Sonnet for the same job class. Ten times cheaper.
Caveat: I'm comparing an 8B model to frontier models. Apples to small oranges. But for the agent jobs where 8B is good enough, the math is settled.
Where local wins, where it doesn't
Wins:
- High-volume classification and extraction
- Anything privacy-sensitive (client data, medical, financial)
- Latency-sensitive interactive flows (no network round trip)
- Burst workloads that would smash cloud rate limits
Loses:
- Hard reasoning, multi-step planning, code generation at frontier quality
- Anything where you actually need the model's knowledge depth
- Workloads with idle gaps where the GPU sits dark and you eat the depreciation anyway
The right move for most builders right now is hybrid. Cloud frontier for the hard 20%. Local 8B or 14B for the routine 80%. Route between them based on task class.
What Tunguz is actually saying
His VC framing matters. When Tunguz posts about local LLMs, every CTO who reads his Sunday digest just got cover to take this seriously. The conversation moved from "Patrick's weird hobby rig" to "tier-1 VC thesis" in one blog post.
If you've been waiting for permission to test a local-first or hybrid architecture, this is it.
What this means for cost-controlled agents
I built AgentGuard because cost is the thing that kills agent projects in production. Local LLMs don't make cost discipline optional. They make it more important, because now you have three cost dimensions (cloud spend, local electricity, local hardware amortization) instead of one.
The same AgentGuard policies that cap your cloud budget should cap your local inference budget too. A runaway loop on a local model still burns wattage, still keeps your GPU at 90C, still pegs your CPU. Free at the margin doesn't mean free in practice.
If you want to dig deeper into the consumer-GPU production setup, I wrote about running local LLM inference on consumer GPUs earlier this year. That post covers the stack choices, the quant tradeoffs, and the model-routing logic I use.
Bottom line
Localmaxxing is real. The numbers are real. The hardware is in stock. The tools are stable.
If you're building agents and your cloud bill is climbing, the answer might not be a better prompt or a cheaper model tier. It might be a $2,000 GPU and a weekend with llama.cpp.
Then put a budget on it. AgentGuard handles that part.
Want more like this?
AI agent builds, real costs, what works. One email per week. No fluff.
Patrick Hughes
Building BMD HODL — a one-person AI-operated holding company. Nashville, Tennessee. Twenty-Two agents.
More writing
- 5 min
GPU Prices Up 48% in Two Months. I Run LLMs in My Garage.
Blackwell rental hit $4.08/hr. CoreWeave raised prices 20%. Anthropic restricted their newest model to 40 orgs. Meanwhile, consumer GPUs are sitting idle.
- 6 min
Anthropic's Advisor Tool Is the Cost-Split Pattern You Should Already Be Running
Anthropic shipped a pattern where a cheap model runs the loop and escalates to Opus only when it needs to. The pattern works on any two-model setup. Here is the math and the playbook.
- 7 min
llama.cpp --n-gpu-layers Explained: -1, 0 & VRAM Calculator (2026)
Stop guessing your GPU layers. --n-gpu-layers -1 offloads everything to VRAM, 0 stays on CPU. See the exact VRAM-per-layer math, real 4060–4090 benchmarks, and find your optimal setting in seconds.