May 2, 20264 min read

The CrewAI demo worked. Then the tool call retried 913 times.

The demo worked. Then the same CrewAI tool call retried until the run became an operator problem.

Here is the agent failure mode nobody shows in the demo.

The CrewAI flow works on the first run. The agents are named well. The tasks are clear. The tool call returns the right data.

Then production input changes.

The vendor API returns a 429. The search tool returns the same empty page. The file tool cannot find the path.

The agent does what agents do.

It tries again.

Why this gets expensive

One retry is fine. Ten retries might be fine.

Nine hundred retries is not a bug report. It is a bill.

The problem is not CrewAI. The problem is shipping an autonomous loop without runtime limits.

If the agent can retry, the runtime needs to know:

For a CrewAI workflow, I want a simple map:

Crew -> agent -> task -> tool call -> retry -> budget -> alert -> kill state.

Not a wall of spans. Not a giant trace log.

A control map.

When a tool repeats, I want it obvious. When spend crosses a cap, I want it obvious. When a kill switch is armed, I want it obvious.

That is the difference between watching an agent and operating one.

Example:

Agent: vendor research agent. Task: enrich a vendor before contract review. Tool: company search API.

The API starts returning 429. The agent keeps asking the same question. The retries produce no new data. Cost rises. No one gets an alert.

That is exactly the kind of run that should stop itself.

Good does not mean the agent never fails.

Good means the failure is bounded.

That is what a client should see before trusting an agent with real work.

Patrick Hughes

Building BMD HODL — a one-person AI-operated holding company. Nashville, Tennessee. Twenty-Two agents.

AI agent builds, real costs, what works. One email per week. No fluff.