May 15, 20267 min read

Localmaxxing isn't theory. Here's what my 3-GPU rig actually does.

Tom Tunguz called it localmaxxing. I run a 3070 + 5070 Ti + 5090 in one box and serve Llama 3.1 8B locally every day. Here are the real tokens-per-second, the real watts, and the real cost per million tokens.

#local-llm #ai-economics #agent-cost-control #gpu-inference

Share LinkedIn

Tom Tunguz wrote a post this week called Localmaxxing. His thesis: open-weight models on prosumer hardware now match cloud-tier quality for a sliver of the cost. The gap closed. The math flipped.

I've been running this setup for months. RTX 3070, RTX 5070 Ti, RTX 5090, all in one tower, serving Llama 3.1 8B through llama.cpp. So let me skip the thesis and put real numbers on the table.

The rig

One Threadripper box. Three GPUs. 80GB of total VRAM if you stack them, though I don't pool them for a single 8B model. I run Llama 3.1 8B in Q5_K_M quant. That fits comfortably on the 5090 alone with room to spare for a 32k context window.

The 3070 and 5070 Ti run smaller models in parallel for different agent jobs. Embeddings on one, a 3B classifier on another. The 5090 is the workhorse.

Tokens per second on Llama 3.1 8B

On the 5090, Q5_K_M, single batch, no flash-attention tweaks beyond defaults:

Prompt processing: ~3,200 tok/s
Generation: ~140 tok/s sustained

For comparison, Claude Opus and GPT-4-class APIs land around 30-80 tok/s on generation depending on load. My local 8B is faster than the frontier cloud APIs for raw throughput. It's a smaller model, so output quality is lower for hard reasoning. For 80% of agent work (classify, extract, summarize, route, format), it's plenty.

Cost per million tokens

Cloud reference points:

GPT-4o: ~$5 input / $15 output per million
Claude Sonnet 4.5: ~$3 input / $15 output per million
Llama 3.1 8B on Together / Fireworks: ~$0.20 per million blended

My local cost, including amortized hardware and Texas electricity at $0.11/kWh:

The 5090 pulls about 400W under sustained inference. At 140 tok/s, one hour of generation produces 504,000 tokens for 0.4 kWh, or about 4.4 cents. That's $0.087 per million output tokens. Round it to 9 cents.

Hardware amortization is the bigger line. Call it $2,200 for the 5090 over 3 years of mixed use. If the card pulls 1,000 hours of inference per year, that's $0.73 per hour, or about $1.45 per million tokens.

Total all-in: roughly $1.55 per million output tokens on local, versus $15 on Claude Sonnet for the same job class. Ten times cheaper.

Caveat: I'm comparing an 8B model to frontier models. Apples to small oranges. But for the agent jobs where 8B is good enough, the math is settled.

Where local wins, where it doesn't

Wins:

High-volume classification and extraction
Anything privacy-sensitive (client data, medical, financial)
Latency-sensitive interactive flows (no network round trip)
Burst workloads that would smash cloud rate limits

Loses:

Hard reasoning, multi-step planning, code generation at frontier quality
Anything where you actually need the model's knowledge depth
Workloads with idle gaps where the GPU sits dark and you eat the depreciation anyway

The right move for most builders right now is hybrid. Cloud frontier for the hard 20%. Local 8B or 14B for the routine 80%. Route between them based on task class.

What Tunguz is actually saying

His VC framing matters. When Tunguz posts about local LLMs, every CTO who reads his Sunday digest just got cover to take this seriously. The conversation moved from "Patrick's weird hobby rig" to "tier-1 VC thesis" in one blog post.

If you've been waiting for permission to test a local-first or hybrid architecture, this is it.

What this means for cost-controlled agents

I built AgentGuard because cost is the thing that kills agent projects in production. Local LLMs don't make cost discipline optional. They make it more important, because now you have three cost dimensions (cloud spend, local electricity, local hardware amortization) instead of one.

The same AgentGuard policies that cap your cloud budget should cap your local inference budget too. A runaway loop on a local model still burns wattage, still keeps your GPU at 90C, still pegs your CPU. Free at the margin doesn't mean free in practice.

If you want to dig deeper into the consumer-GPU production setup, I wrote about running local LLM inference on consumer GPUs earlier this year. That post covers the stack choices, the quant tradeoffs, and the model-routing logic I use.

Bottom line

Localmaxxing is real. The numbers are real. The hardware is in stock. The tools are stable.

If you're building agents and your cloud bill is climbing, the answer might not be a better prompt or a cheaper model tier. It might be a $2,000 GPU and a weekend with llama.cpp.

Then put a budget on it. AgentGuard handles that part.

Want more like this?

AI agent builds, real costs, what works. One email per week. No fluff.

Patrick Hughes

Building BMD HODL — a one-person AI-operated holding company. Nashville, Tennessee. Twenty-Two agents.

Localmaxxing isn't theory. Here's what my 3-GPU rig actually does.

The rig

Tokens per second on Llama 3.1 8B

Cost per million tokens

Where local wins, where it doesn't

What Tunguz is actually saying

What this means for cost-controlled agents

Bottom line

Want more like this?

More writing

GPU Prices Up 48% in Two Months. I Run LLMs in My Garage.

Anthropic's Advisor Tool Is the Cost-Split Pattern You Should Already Be Running

llama.cpp --n-gpu-layers Explained: -1, 0 & VRAM Calculator (2026)