bmdpat
All writing
4 min read

The flat-fee era is over. How to control your AI agent costs in 2026.

Anthropic shifted enterprise billing to per-token pricing. Every provider is expected to follow within six months. Here's how agent costs change and how to cap them at runtime.

Share LinkedIn

The flat-fee era is over. How to control your AI agent costs in 2026.

Anthropic just shifted enterprise billing from flat-fee to per-token. Every major provider is expected to follow within six months. If your AI agents are in production, your cost structure just changed. You need runtime budget guards now, not in Q3.

For the last two years, flat-fee enterprise plans absorbed the weird stuff. Agent looping? Flat fee. Tool-call storm? Flat fee. Prompt bloat from a bad merge? Flat fee. The bill was predictable even when usage wasn't.

That era is over.

What changed

Anthropic's Enterprise tier switched from flat-fee to per-token pricing this week. Every token in, every token out, billed at metered rates. No ceiling. No flat absorption.

The reporting is in Implicator AI and the move is not isolated. Every major provider is on the same trajectory. Google and OpenAI are expected to announce similar shifts within six months. Flat-fee pricing was a subsidy for early adoption. Now that production AI is table stakes, the subsidy is gone.

What it means for teams running agents

Under flat-fee, the question was "do we have the plan?" Under per-token, the question is "how much are we burning per session?"

Here's the math. A coding agent typically uses:

  • 40k tokens of system prompt, tools, and CLAUDE.md (cached)
  • 5-15k tokens of working context per turn
  • 1-3k tokens of output per turn

At Claude Sonnet 4 rates (~$3 per million input, $15 per million output), one turn is around $0.10. A 40-turn session is $4. Ten sessions a day across a team is $40/day. Twenty workdays is $800/month.

That's the normal case. Failure cases are worse.

Three failure modes that only hurt under per-token

Stuck loops. Your agent decides to re-run the same tool 200 times to check something. Under flat-fee: invisible. Under per-token: $30 on your card before anyone notices.

Bad tools. A shell tool that returns 10MB of output on every call. The agent reads it, reasons over it, calls again. Token usage compounds.

Prompt bloat. Someone adds a new section to CLAUDE.md and bumps the base context by 20k tokens. Every session pays for that on every turn. The bill climbs 20% next month and nobody knows why.

Under flat-fee, these cost nothing. Under per-token, they cost you immediately and silently.

Why cost caps were optional before and mandatory now

Most teams don't have runtime cost controls because flat-fee made them unnecessary. Maybe you rate-limited the API key. Maybe you didn't. Either way, the monthly cap absorbed misbehavior.

Now the monthly cap is gone. You need enforcement at the runtime layer, below the model, above the application code. Not after-the-fact analytics. Not dashboards that show the damage the next day. A kill switch that stops the agent mid-session when it crosses a threshold.

The observability-first approach is wrong for this class of problem. Tracing is for debugging, not for budget enforcement. By the time a Grafana dashboard tells you the bill is up, the money is already gone.

How to defend

Three controls you want in every agent process:

1. Budget guard. Hard dollar cap per session. When cost crosses $X, the agent terminates. Not throws a warning. Terminates.

2. Loop guard. Detect repeated tool calls within a sliding window. If an agent calls read_file(same_path) five times in a row, something is wrong and you don't want to pay for the sixth.

3. Timeout guard. Wall-clock kill switch. No session runs longer than N minutes without explicit re-authorization.

The common theme: these run in-process with your agent, not at the dashboard layer. They fail the call before it hits Anthropic's API, not after.

Monitoring is useful. Monitoring plus enforcement is the full answer. Only monitoring is not.

What I use

I built AgentGuard because the tools that existed lived at the observability layer. They told you what happened. They didn't stop it from happening.

AgentGuard is a Python SDK that provides runtime guards for any AI agent: budget, loop, timeout, rate limit, tool-call ceiling, custom rules. One pip install, zero dependencies, works with any LLM provider (OpenAI, Anthropic, local, whatever).

pip install agentguard47. MIT license, OSS. Pro dashboard is $39/mo if you want alerting and historical spend tracking.

The shift to per-token pricing is not going to reverse. Every team running agents in production needs a cost floor under their worst day. Build it now, or learn about the need at the end of the month when the invoice arrives.

Set runtime budget guards on your AI agents with AgentGuard →

PH

Patrick Hughes

Building BMD HODL — a one-person AI-operated holding company. Nashville, Tennessee. Fifteen agents.

Want more like this?

AI agent builds, real costs, what works. One email per week. No fluff.

More writing