How I Make Local Model Runs Fail Safely On A 5090
A local model run should prove its safety path before it proves a score. Here is the small guardrail loop I use on my RTX 5090 for QLoRA starter work.
The fastest way to lose a night on local AI is to treat the GPU like a black box. I want a local agent run to fail like a normal build step, not like a mystery job.
Summary: My current Phase 3 SFT starter on the RTX 5090 uses a verified 150-row dataset, a scale-5 first run, a 0.72 GPU memory cap, memory and temperature checks before model load, a 15-second cleanup pause, and 7 tests before I trust the output. Those numbers come from the local Phase 3 report and run log from 2026-07-01. For agent runs, I put AgentGuard beside that hardware loop so token, rate, and budget limits fail before the job spends too much.
Canonical URL: https://bmdpat.com/blog/local-model-runs-fail-safely-5090

Why should a local model run fail before it trains?
A local model run is not just a model call. It is a hardware job, a data job, a logging job, and a recovery job at the same time. If any one of those parts is fake, the whole result is theater.
I care about the failure path first because it is the path I will see at 1:00 AM. CUDA is touchy. A package changed. The model loads halfway, then falls over. If the run only works on the perfect path, it is not ready to leave alone.
Where does AgentGuard fit in a local run?
The GPU guard protects the machine. AgentGuard protects the loop around the model. That matters when a local model sits inside an agent that can retry, call tools, write files, or fall back to an API model.
I want both layers to behave the same way. Set a limit. Check it before expensive work. Fail loud when the limit is crossed. Leave a trace. Hardware checks catch heat and memory pressure. AgentGuard catches token, rate, and budget pressure before the agent turns a bad loop into spend.
What did I test on the 5090?
The current starter is Phase 3 of my local 5090 rig. It is pointed at the Phase 2 dataset, which the local report says has 150 rows with gateway provenance. The direct run used TRUNCATE=1 and SCALE=5, then wrote five metric rows.
That will not prove model quality. It proves the wiring. It proves the loader can find real data, the train path can start, the fallback can run, the metrics file can be written, and the tests can read the output.
Which guardrails belong around the GPU?
The guardrail I care about most is the boring one. Check the GPU before the expensive action. In the Phase 3 run, the script checked memory and temperature before each model load. The local report records a 0.72 memory cap and a pause rule around high memory or high temperature.
The run log recorded memory around 6.2% and temperature from 40C to 42C during the scale-5 pass. I treat that as a sanity check, not a capacity claim. The lesson is the loop shape, not the exact number. Check first, run small, clean up, then repeat.
Why keep a fallback path?
The Unsloth path did what real local AI work often does. It started the 4-bit QLoRA load path, then failed with an AttributeError in the model load stack. That is annoying. It is also useful.
A fragile script would crash and leave me with a half-answer. This script logged the safe failure, cleared resources, slept for 15 seconds, and moved to the next iteration. At the end it still wrote five metric rows and printed the final line: Phase 3 SFT wrote 5 rows (scale=5, real) ... via streaming.
That is the bar. The fallback is not fake success. It is proof that the job can fail without losing the audit trail.
How do tests keep the run honest?
The Phase 3 starter has 7 tests. They check the real dataset load, exact row counts, CLI environment variables, row shape, sync behavior, and scratch log proof. That is not a full eval suite. It is a smoke gate for the machinery.
AgentGuard belongs in the same test mindset. If an agent has a token cap, budget cap, or rate cap, test the blocked path too. The blocked path is not a corner case. It is the whole point of the guard.
What should a builder copy from this?
Copy the run contract, not my exact stack.
Start with real input data. Run the smallest job that proves the full path. Put a resource check before model load. Put cleanup in a finally path. Write append-only metrics. Keep the final log line boring and easy to search. Add AgentGuard around any agent loop that can spend tokens or money. Run tests that prove row counts, real file paths, and blocked limits.
Accompanying prompt
What the prompt does: It turns a local model experiment into a fail-safe run contract before you spend a night on it.
Copy/paste this prompt:
Copy-ready prompt
Paste the exact block into your coding agent.
No article chrome, no footnotes, no formatting drift.
Copy the block above.
If you are building agents that can burn tokens or money when a loop goes sideways, put a limiter in the path. AgentGuard is my open source budget and runtime guard: https://bmdpat.com/tools/agentguard
Get the local AI lab notes
Benchmark rows, VRAM fit checks, quant choices, and what actually runs on consumer GPUs. M-F, only when there is something worth sending.
Patrick Hughes
Building BMD HODL — a one-person AI-operated holding company. Nashville, Tennessee. Twenty-Two agents.
More writing
- 5 min
Use Owner Gates and AgentGuard to Keep AI Agents Moving
AI agents need two rails before they can run unattended: owner gates for judgment and AgentGuard for spend. Without both, the operator becomes the fallback.
- 4 min
Your AI Agent Says "Done." Make It Prove It.
AI agents report work as done that they never did. Make every completion a falsifiable claim a script can verify before you trust it.
- 4 min
Give Your AI Agents an Append-Only Event Log
An append-only event log lets you replay exactly what your AI agent did, and catches the crashed runs a status field hides.