How I Let an AI Agent Run 100 ML Experiments Overnight on a $500 GPU
I let an autonomous agent run 100 ML experiments while I slept. 7 succeeded. Net result: 25% model improvement. Here's the setup.
How I Let an AI Agent Run 100 ML Experiments Overnight on a $500 GPU
Last week I let an AI agent run 100 machine learning experiments overnight on my RTX 3070. I woke up to a 25% model improvement. Here's exactly how it works.
The Setup
The agent is built on Karpathy's autoresearch concept, powered by Claude Sonnet. It runs in a loop:
- Propose — The agent analyzes current model performance and proposes a specific code change
- Implement — It writes the actual Python code to modify the neural network
- Train — The modified model trains on PubMed medical text data
- Evaluate — Loss metrics are compared against the baseline
- Decide — If improvement > threshold, keep the change. Otherwise, revert.
- Repeat — Go back to step 1 with updated context
The Results
Out of 100 experiments:
- 93 failed — proposed changes made the model worse or had no effect
- 7 succeeded — measurable improvements that the agent kept
- Net result — 25% improvement in model performance
The 7% hit rate sounds low, but that's the point. Research is mostly failure. The agent runs experiments I'd never have time to try manually.
What the Agent Discovered
The 7 successful experiments included:
- Learning rate scheduling changes I wouldn't have tried
- A specific attention head configuration that improved convergence
- Batch size adjustments that were counterintuitive but worked
- Layer normalization placement that contradicted my assumptions
The Hardware
This runs on consumer hardware:
- GPU: NVIDIA RTX 3070 (8GB VRAM) — ~$500
- CPU: Standard desktop AMD Ryzen
- RAM: 32GB
- Storage: 1TB NVMe SSD
Total cost for the overnight run: about $0.50 in electricity + Claude API calls. That near-zero per-experiment cost comes from running inference locally rather than through a cloud provider. Here's what serving a live LLM from a consumer GPU actually looks like in production.
Why This Matters
The traditional ML research loop is: human thinks of experiment → human implements it → human waits for training → human evaluates → human thinks of next experiment.
Each cycle takes hours or days of human attention. My agent does it in minutes and runs 24/7.
The Code
The agent is ~300 lines of Python orchestrating:
- Claude Sonnet for reasoning and code generation
- PyTorch for training
- A simple SQLite database tracking all experiments
- Git for version control of each experiment
It's not magic. It's a loop with good prompts and clear evaluation criteria.
What I Learned
- Autonomy requires clear metrics — the agent needs an unambiguous way to measure success
- Failure is the feature — 93% failure rate is fine when experiments are cheap
- Consumer hardware is enough — you don't need cloud GPUs for meaningful research
- Overnight is the killer use case — run experiments while you sleep, review results over coffee
Try It Yourself
You need:
- A GPU (even a 3060 works)
- An API key for Claude or GPT
- A clear metric to optimize
- Patience to debug the loop
The hardest part isn't the code — it's defining what "better" means for your specific model.
Curious about commissioning something like this rather than building it yourself? Here's what custom autonomous agents actually cost in 2026.
Want me to build an autonomous agent for your workflow? Start a project →
Patrick Hughes
Building BMD HODL — a one-person AI-operated holding company. Nashville, Tennessee. Twenty-Two agents.
Want more like this?
AI agent builds, real costs, what works. One email per week. No fluff.
More writing
- 7 min
Run Llama 3.1 on an RTX 5070 Ti at 50 req/s — $0 Per Call (2026 Guide)
Consumer GPUs now run production LLM inference. RTX 5070 Ti hits 50 req/s on Llama 3.1 for zero API cost. Full benchmarks, cloud cost comparison, and the exact setup.
- 8 min
Stop Runaway LLM Spend: AI Agent Cost Control (Python)
One bad loop and an AI agent burned $200 in minutes. AgentGuard is a Python SDK that enforces hard cost limits at runtime — here is how to ship it.
- 7 min
OpenClaw AI Review 2026: Faster Than Custom Agents, but at What Cost?
OpenClaw ships AI agents fast — but custom builds often win on ROI. Honest side-by-side comparison on cost, flexibility, lock-in risk, and time-to-deploy.