April 16, 20265 min read

GPU Prices Up 48% in Two Months. I Run LLMs in My Garage.

Blackwell rental hit $4.08/hr. CoreWeave raised prices 20%. Anthropic restricted their newest model to 40 orgs. Meanwhile, consumer GPUs are sitting idle.

#local-llm #gpu #infrastructure #agentguard #compute

Share LinkedIn

Tom Tunguz published a piece this week on GPU compute scarcity. The numbers are wild.

The cloud GPU crisis

Nvidia Blackwell rental: $4.08/hr. Two months ago it was $2.75. That's a 48% increase.
CoreWeave raised prices 20% and extended minimum contracts from 1 year to 3 years.
Anthropic restricted their newest model to roughly 40 organizations.
OpenAI's CFO Sarah Friar: "We're making some very tough trades at the moment on things we're not pursuing because we don't have enough compute."

Tunguz identifies five dynamics hurting smaller players: relationship-based access (you need to know someone), price barriers, speed uncertainty, rising commodity costs, and forced migration to alternatives.

The last one is the interesting part.

Forced alternatives means local inference

When cloud GPU prices go up 48% in two months, the math changes. When minimum contracts go from 1 year to 3 years, the commitment changes. When your model provider restricts access to 40 orgs, the reliability changes.

The market is pushing people toward two alternatives: smaller models and on-premise infrastructure.

I'm already there.

My setup

I run local LLM inference on consumer hardware in my garage in Nashville:

RTX 5090 (primary, 32GB VRAM)
RTX 5070 Ti (secondary, 16GB VRAM)
RTX 3070 (legacy, 8GB VRAM)

Llama 3.1 8B runs on the 5070 Ti via llama.cpp. Inference costs: electricity. No API calls. No rate limits. No 3-year contracts. No relationship-based access.

The 5090 handles bigger models when needed. 32GB VRAM fits quantized versions of models that would cost $4.08/hr to rent on Blackwell hardware.

The math

Let's say you need 4 hours of GPU time per day for inference.

Cloud (Blackwell): $4.08/hr x 4 hrs x 30 days = $489/month. And that's if you can get access.

Local (RTX 5090): The card costs roughly $2,000. Electricity for 4 hours of inference per day is maybe $15/month. After 4 months, you've broken even. After that, it's $15/month forever.

$489/month vs $15/month. And you own the hardware.

This ignores the capability gap between Blackwell and consumer cards. Blackwell is faster. Blackwell handles bigger models. But for the 8B-70B parameter range that most production inference needs, consumer GPUs are more than enough.

When local doesn't work

Local inference is not a universal answer.

You need cloud GPUs when:

You're training models (not just inference)
You need 100+ concurrent users with low latency
You're running 400B+ parameter models that don't fit in consumer VRAM
You need enterprise SLAs and uptime guarantees

But most builders aren't doing those things. Most builders need inference for their agents, their internal tools, their prototypes. Consumer hardware handles that.

The real moat

Compute scarcity is a moat for whoever controls the GPUs. It's also a moat for whoever figures out how to not need them.

If your AI agent runs on a $2,000 GPU in your office instead of a $4.08/hr cloud rental, you have a cost advantage that gets wider every month. When your competitor's cloud bill goes up, yours stays flat.

That's not theoretical. Tunguz's data shows the trend is accelerating. Cloud GPU costs are going up. Consumer GPU costs are going down (relative to capability per dollar).

What to do

Audit your inference costs. How much are you spending on API calls that could run locally?
Start with one model locally. Llama 3.1 8B on an RTX 3070 is a good starting point. $0/month after hardware.
Keep cloud for what needs cloud. Training, high-concurrency production, frontier models.
Guard your cloud budget. Whether you're on cloud or local, set runtime limits so agents can't burn money unattended.

pip install agentguard47

AgentGuard works with any provider, cloud or local. Budget caps, loop detection, and timeout guards protect your agent runs regardless of where inference happens.

Get started with AgentGuard

Want more like this?

AI agent builds, real costs, what works. M-F only when there is something worth sending. No fluff.

Patrick Hughes

Building BMD HODL — a one-person AI-operated holding company. Nashville, Tennessee. Twenty-Two agents.