Claude Code caching is eating your budget. Here's what's happening.
Claude Code's prompt caching has two TTLs and two price tiers. Most developers have no idea which one they're paying for. Here's how cache writes eat budgets and how to stop it.
Claude Code caching is eating your budget. Here's what's happening.
Claude Code's prompt caching has two TTLs (5 minutes and 1 hour), two price tiers (25% and 100% premium on cache writes), and most people have no idea which one they're paying for. The Register reported this week that users are hitting unexpected bills and rate limits. If you run Claude Code in production, you need to read this.
The Register ran a piece this week on Claude Code cache confusion. Short version: users are paying more than they expected, performance is inconsistent, and nobody has a clear mental model for when cache writes get charged at 25% versus 100% over base input tokens.
I've been watching my own Claude Code bill closely. Here's what's happening and what to do about it.
The two TTLs
Claude's prompt caching has two expiration windows:
- 5-minute (ephemeral): the default. Cache writes cost 1.25x base input tokens. Reads cost 0.1x.
- 1-hour: opt-in for longer sessions. Cache writes cost 2x base input tokens. Reads are still 0.1x.
The minimum cacheable block is around 1024 tokens. You get up to 4 cache breakpoints per request.
Tight back-and-forth inside a 5-minute window? The 5-min cache is fine. You write once, read many, amortize the cost.
Coding for an hour, stepping away for coffee, coming back? The 5-min cache expired while you were gone. Next request pays the full write again.
Nobody tells you which mode your session is in at any given moment.
What a day of coding actually costs
Rough math. Your Claude Code session has:
- 40-50k tokens of tools, CLAUDE.md, and scaffolding that gets cached
- Some currently-open files (another 5-20k)
- 1-3k tokens per turn of conversation
With 5-min caching over a typical 2-hour session where you idle 4 times (coffee, Slack, lunch, thinking):
- First turn: 1.25x on ~50k cached tokens. That's 12.5k tokens of write premium on top of the base cost.
- Reads inside the 5-min window: about $0.02 per turn at Sonnet rates.
- Each idle longer than 5 minutes triggers another rewrite. Another 12.5k in premium.
Four rewrites in a day = around 50k tokens of pure cache-write premium. At current Sonnet input rates (~$3 per million), that's about $0.15/day per active session in overhead you weren't tracking.
Sounds small. Scale it. Five sessions a day, 20 workdays a month. That's $15/month in cache-write overhead alone. More if your CLAUDE.md is large or you have a big tool set.
Per-token pricing that shifted this month makes this worse. You no longer have a flat subscription to absorb the variance.
Why this is worse than it sounds
Three things compound:
Cache writes on every fresh session. Every claude cold-start rewrites the full system prompt. 1.25x on ~40k tokens. At least $0.15 just to start each session.
The 1024-token floor. Blocks under 1024 tokens don't cache at all. You pay full input price every time. Small files and short tool definitions silently fall off the cache path.
The 1-hour option is not automatic. You need to pass cache_control: { type: "ephemeral", ttl: "1h" } in the API call. Claude Code the CLI handles this in most paths, but if you're driving the SDK directly, you're on 5-min caching until you opt in.
The Register flagged the bigger issue: the billing surface is opaque. You see the total on your invoice. You don't see cache-write vs cache-read vs base-input split out the way you'd split out the code that caused them.
Per-token pricing makes it a real problem
Anthropic shifted to per-token billing on several tiers this month. Max-plan-style flat absorption is gone on those paths. Every cache miss hits your card.
Per-token plus opaque cache costs = your bill is variable AND unpredictable.
Variable is fine if it's predictable. Unpredictable is fine if it's bounded. Both at once is the exact pattern that sinks AI-dev-tool budgets.
How to control it
Three things, in order:
1. Set a daily token budget, not a monthly one. You want the alert Monday afternoon, not Friday morning when the damage is done.
2. Close idle sessions. A 2-hour tmux-backgrounded Claude Code window is burning cache rewrites on every follow-up. Open a fresh one instead of reviving a stale one.
3. Put a hard cap on every agent process you spawn. If you script Claude Code via claude --headless or the SDK, don't trust the process to self-regulate. Cap the runtime at the guardrail layer, below the model.
Number three is the one most people skip and most often regret.
What I use
I built AgentGuard for exactly this. Runtime cost and safety guardrails for any agent you spawn, including Claude Code via SDK. Set a dollar budget, a max number of turns, a timeout, and a tool-call rate limit. The agent stops when it crosses a line. You find out at the cap, not on the next invoice.
pip install agentguard47. MIT license, OSS. Pro dashboard is $39/mo if you want alerting and historical spend charts.
The Register article is a warning shot. Prompt caching is a sharp tool. Use it wrong and it compounds. Use it right with a hard runtime cap and it's fine.
Don't wait for your next invoice to find out which one you're doing.
Patrick Hughes
Building BMD HODL — a one-person AI-operated holding company. Nashville, Tennessee. Fifteen agents.
Want more like this?
AI agent builds, real costs, what works. One email per week. No fluff.
More writing
- 4 min
AI security is now a token-burning contest. Who's watching the bill?
Simon Willison frames AI-assisted security research as proof of work: more tokens in, more bugs found. That's an economic reality. Here's what the spend curve actually looks like and how to put a floor under it.
- 4 min
The flat-fee era is over. How to control your AI agent costs in 2026.
Anthropic shifted enterprise billing to per-token pricing. Every provider is expected to follow within six months. Here's how agent costs change and how to cap them at runtime.
- 4 min
What AI-native startups actually look like in 2026 (and I'm running one from Tennessee)
Flatiron Health toured AI-native startups in SF. One PM covers five companies, Claude Code is replacing Cursor, non-engineers are shipping production. I'm running the same model from Tennessee as a solo holding company. Here's what that actually looks like.