How I Let an AI Agent Run 100 ML Experiments Overnight on a $500 GPU
I built an autonomous research agent that proposes neural network changes, trains models, evaluates results, and iterates — all while I sleep. Here's what happened.
How I Let an AI Agent Run 100 ML Experiments Overnight on a $500 GPU
Last week I let an AI agent run 100 machine learning experiments overnight on my RTX 3070. I woke up to a 25% model improvement. Here's exactly how it works.
The Setup
The agent is built on Karpathy's autoresearch concept, powered by Claude Sonnet. It runs in a loop:
- Propose — The agent analyzes current model performance and proposes a specific code change
- Implement — It writes the actual Python code to modify the neural network
- Train — The modified model trains on PubMed medical text data
- Evaluate — Loss metrics are compared against the baseline
- Decide — If improvement > threshold, keep the change. Otherwise, revert.
- Repeat — Go back to step 1 with updated context
The Results
Out of 100 experiments:
- 93 failed — proposed changes made the model worse or had no effect
- 7 succeeded — measurable improvements that the agent kept
- Net result — 25% improvement in model performance
The 7% hit rate sounds low, but that's the point. Research is mostly failure. The agent runs experiments I'd never have time to try manually.
What the Agent Discovered
The 7 successful experiments included:
- Learning rate scheduling changes I wouldn't have tried
- A specific attention head configuration that improved convergence
- Batch size adjustments that were counterintuitive but worked
- Layer normalization placement that contradicted my assumptions
The Hardware
This runs on consumer hardware:
- GPU: NVIDIA RTX 3070 (8GB VRAM) — ~$500
- CPU: Standard desktop AMD Ryzen
- RAM: 32GB
- Storage: 1TB NVMe SSD
Total cost for the overnight run: about $0.50 in electricity + Claude API calls.
Why This Matters
The traditional ML research loop is: human thinks of experiment → human implements it → human waits for training → human evaluates → human thinks of next experiment.
Each cycle takes hours or days of human attention. My agent does it in minutes and runs 24/7.
The Code
The agent is ~300 lines of Python orchestrating:
- Claude Sonnet for reasoning and code generation
- PyTorch for training
- A simple SQLite database tracking all experiments
- Git for version control of each experiment
It's not magic. It's a loop with good prompts and clear evaluation criteria.
What I Learned
- Autonomy requires clear metrics — the agent needs an unambiguous way to measure success
- Failure is the feature — 93% failure rate is fine when experiments are cheap
- Consumer hardware is enough — you don't need cloud GPUs for meaningful research
- Overnight is the killer use case — run experiments while you sleep, review results over coffee
Try It Yourself
You need:
- A GPU (even a 3060 works)
- An API key for Claude or GPT
- A clear metric to optimize
- Patience to debug the loop
The hardest part isn't the code — it's defining what "better" means for your specific model.
Want me to build an autonomous agent for your workflow? Start a project →
Need this built for you?
I build AI agents and automated workflows. Async delivery. No meetings. Flat rate.
Start a Project