AI software runs on 17% margins. SaaS runs on 70%. The token bill is the problem.
AI-native software is shipping at roughly 17% gross margins while traditional SaaS sits near 70%. The token bill ate the unit economics. Here's what's actually broken and how to claw margin back.
AI software runs on 17% margins. SaaS runs on 70%. The token bill is the problem.
A new analysis from Gptomics put a number on something every AI founder has been feeling. AI-native software businesses are running at about 17% gross margins. Traditional SaaS sits near 70%. The gap is the token bill.
If you ship an AI product and your COGS line keeps creeping, this is why. You did not misprice on purpose. You repriced without noticing.
Where the margin actually went
A SaaS request costs you a few CPU cycles, some bandwidth, and a database read. Pennies on the thousand.
An AI request costs you tokens. And it is rarely one request.
- One user message becomes 3 to 12 model calls once you add retrieval, tool use, and a planner.
- Retries on rate-limit or 5xx errors double the bill on a bad day.
- Evals and guardrails run their own model calls on every turn.
- Memory and context grow, so input tokens grow, so every subsequent call gets more expensive.
- Long-running agents loop. A single stuck agent can burn $40 in an afternoon before anyone notices.
You priced the product like a SaaS app. You are operating it like a call center where every minute on the phone is metered.
The three founder mistakes that lock you at 17%
I have looked at a lot of AI agent deployments in the last year. The same three holes show up.
1. No hard cost cap per user, per tenant, or per session.
If a single power user can spend $200 in a week on a $29 subscription, you are not running a SaaS business. You are running an unhedged short on token prices. The fix is a budget at the entity level, enforced before the model call, not in a dashboard you check on Monday.
2. No model fallback ladder.
Every call goes to your best model. Most of those calls did not need it. A two-step ladder of cheap-first, escalate-on-failure cuts 40 to 70% of token spend on the routes I have actually measured. The work is not glamorous. The savings are.
3. No per-tenant telemetry on token spend.
You know revenue per customer. You do not know cost per customer. So when a whale starts costing you more than they pay, you find out at quarter close. By then it has been three months.
These three holes are how a 70% margin product becomes a 17% margin product without anyone shipping a bad decision. Each one is a small omission that compounds.
What 30%+ margins look like
You are not getting back to SaaS 70%. The token bill is real. But 30 to 45% is doable, and that is the difference between a company and a science project.
The pattern that works:
- Budget caps at every layer. Per user, per workspace, per route. Hard stops, not warnings. When the cap hits, the request gets a graceful degraded response, not a $14 invoice.
- A fallback ladder. Cheap model first. Escalate only when the cheap model fails an eval or the user retries. Default to the floor, not the ceiling.
- Token telemetry per tenant. Every call tagged with user_id, tenant_id, route, model. Cost-per-customer becomes a number on a dashboard, not a quarterly surprise.
- Loop detection. Any agent that calls the model more than N times for one task gets killed. Stuck agents are the single biggest blow-up risk on a token bill.
You can build this yourself. Most teams do, badly, after the first surprise invoice. Or you can drop in something that already does it.
What I built
I wrote AgentGuard for exactly this. It is a Python SDK that wraps your model calls and enforces budgets, fallback, and telemetry at the call site. No new infra. No proxy server. Pip install and add a decorator.
pip install agentguard47
It is the boring infrastructure layer the AI stack still does not have a default for. If you are sitting at 17% margins and trying to figure out where the leak is, start here.
Want more like this?
AI agent builds, real costs, what works. One email per week. No fluff.
Patrick Hughes
Building BMD HODL — a one-person AI-operated holding company. Nashville, Tennessee. Twenty-Two agents.
More writing
- 4 min
Enterprise AI just shifted: Claude +128%, OpenAI -8%. What it means if you're building.
SaaStr data shows enterprise AI share shifting hard toward Claude. The lesson isn't pick Claude. It's stop hard-coding one vendor.
- 4 min
One Agent Skill, Three Registries: PyPI, Claude, and skills.sh
Agent skills are becoming a distribution layer for developer tools. The practical move is one source package that can show up in PyPI, Claude-style skills, and skills.sh.
- 4 min
April 2026: Every AI Subscription Plan Broke for Builders
April 2026 made one thing clear: chat subscriptions are best-effort tools. Builders need API-level budgets, rate limits, and kill switches when the work matters.