[bmdpat]
All writing
4 min read

Your AI Agent Says "Done." Make It Prove It.

AI agents report work as done that they never did. Make every completion a falsifiable claim a script can verify before you trust it.

Share LinkedIn

Last week one of my agents reported that it had logged a new vendor in my Vendor Map file. Clean run. No error. The summary said done. I opened the file. The vendor was not there. Nothing crashed. The agent simply believed it finished work it never did.

AI agents routinely report tasks as complete that were never actually done. A clean exit code and a "done" message prove the agent finished running, not that the work landed. The fix is to treat every reported completion as a falsifiable claim, then run a deterministic checker that verifies each claim against the real file, frontmatter, or command output before you trust it.

Why do AI agents report work they didn't do?

Because the model is predicting a plausible ending, not checking reality. When an agent writes "done," it is generating the most likely next sentence given a transcript that looks like finished work. That is not the same as opening the file and confirming the change is there.

This gets worse the longer the session runs. The agent edits, gets interrupted, edits again, loses track of which write actually committed, and signs off confident. I have watched Claude, Codex, and my own scheduled agents all do it. The failure is not laziness. It is that "I did X" is cheap to say, and the model has no built-in cost for saying it falsely.

What is the difference between a log line and proof?

A log line is a statement. Proof is a check against the world.

"Logged the vendor in Vendor Map" is a statement. It costs nothing and it can be wrong. Proof is: open Vendor Map, search for the vendor name, confirm the line exists. One of those you can automate. The other you have to trust. The whole problem is that we keep trusting the statement and calling it proof.

What does a falsifiable claim look like?

A falsifiable claim is a completion report written so a script can prove it false. Not prose. A small, fixed grammar.

When one of my agents reports it changed something, it has to record the claim in a structured form:

  • file_contains: this file now contains this string
  • path_moved: this file moved from here to there
  • frontmatter: this file's frontmatter field equals this value
  • glob_count: this many files match this pattern
  • command: this command, run under a locked directory, exits clean

If the agent cannot phrase its "done" as one of those, it does not get to call the work done. The constraint forces honesty. Vague claims are exactly the ones that hide false completions.

How do you check a claim automatically?

You write a checker that reads the claims and tests each one. Mine appends every claim to a daily file, and a small script walks the list. file_contains opens the file and greps. path_moved confirms the old path is gone and the new one exists. frontmatter parses the field and compares. command runs and checks the exit code.

If every claim holds, the report is green. If one fails, the report is red and names the claim that lied. The session does not get to mark itself complete until the checker passes. That single rule, completion is legitimate only after the checker is green, is what catches the vendor-map class of bug.

What happens when a claim turns out false?

The report goes red and tells me which claim failed. That is the point. A false "done" used to slip through silently and surface days later when I went looking for work that was never there. Now it surfaces the same day, with the exact claim that did not match reality.

I keep the claims narrow on purpose. A free-form opinion ("this approach is better") carries no check and is recorded but not verified. Only falsifiable statements get tested. You cannot verify a judgment, but you can always verify "this file contains this string."

Where does this save you the most?

Anywhere an agent runs unattended. A coding agent that says it added a test. A nightly job that says it updated a report. A fleet of scheduled tasks reporting completion while you sleep. The longer the gap between the claim and the moment you check it, the more a false "done" costs you.

The principle is the same one behind every external guard you put on an agent: do not trust the agent's self-report, enforce reality from outside. Claims-checking enforces that the work happened. The other half is enforcing that the work did not cost more than it should.

That is the gap AgentGuard fills on the spend side. It wraps an agent with a hard budget, token, and rate ceiling, so a confident agent that decides to retry all night stops at a limit instead of draining your account. Same idea as claims-checking, pointed at cost: don't trust, enforce. Install it free with pip install agentguard47, or read how it works at https://bmdpat.com/tools/agentguard

Want more like this?

AI agent builds, real costs, what works. M-F only when there is something worth sending. No fluff.

PH

Patrick Hughes

Building BMD HODL — a one-person AI-operated holding company. Nashville, Tennessee. Twenty-Two agents.

More writing