AI AgentsMachine LearningConsumer GPUAutonomous ResearchPython

How I Let an AI Agent Run 100 ML Experiments Overnight on a $500 GPU

I built an autonomous research agent that proposes neural network changes, trains models, evaluates results, and iterates — all while I sleep. Here's what happened.

Patrick Hughes
8 min read
Share: LinkedIn Twitter

How I Let an AI Agent Run 100 ML Experiments Overnight on a $500 GPU

Last week I let an AI agent run 100 machine learning experiments overnight on my RTX 3070. I woke up to a 25% model improvement. Here's exactly how it works.

The Setup

The agent is built on Karpathy's autoresearch concept, powered by Claude Sonnet. It runs in a loop:

  1. Propose — The agent analyzes current model performance and proposes a specific code change
  2. Implement — It writes the actual Python code to modify the neural network
  3. Train — The modified model trains on PubMed medical text data
  4. Evaluate — Loss metrics are compared against the baseline
  5. Decide — If improvement > threshold, keep the change. Otherwise, revert.
  6. Repeat — Go back to step 1 with updated context

The Results

Out of 100 experiments:

  • 93 failed — proposed changes made the model worse or had no effect
  • 7 succeeded — measurable improvements that the agent kept
  • Net result — 25% improvement in model performance

The 7% hit rate sounds low, but that's the point. Research is mostly failure. The agent runs experiments I'd never have time to try manually.

What the Agent Discovered

The 7 successful experiments included:

  • Learning rate scheduling changes I wouldn't have tried
  • A specific attention head configuration that improved convergence
  • Batch size adjustments that were counterintuitive but worked
  • Layer normalization placement that contradicted my assumptions

The Hardware

This runs on consumer hardware:

  • GPU: NVIDIA RTX 3070 (8GB VRAM) — ~$500
  • CPU: Standard desktop AMD Ryzen
  • RAM: 32GB
  • Storage: 1TB NVMe SSD

Total cost for the overnight run: about $0.50 in electricity + Claude API calls.

Why This Matters

The traditional ML research loop is: human thinks of experiment → human implements it → human waits for training → human evaluates → human thinks of next experiment.

Each cycle takes hours or days of human attention. My agent does it in minutes and runs 24/7.

The Code

The agent is ~300 lines of Python orchestrating:

  • Claude Sonnet for reasoning and code generation
  • PyTorch for training
  • A simple SQLite database tracking all experiments
  • Git for version control of each experiment

It's not magic. It's a loop with good prompts and clear evaluation criteria.

What I Learned

  1. Autonomy requires clear metrics — the agent needs an unambiguous way to measure success
  2. Failure is the feature — 93% failure rate is fine when experiments are cheap
  3. Consumer hardware is enough — you don't need cloud GPUs for meaningful research
  4. Overnight is the killer use case — run experiments while you sleep, review results over coffee

Try It Yourself

You need:

  • A GPU (even a 3060 works)
  • An API key for Claude or GPT
  • A clear metric to optimize
  • Patience to debug the loop

The hardest part isn't the code — it's defining what "better" means for your specific model.


Want me to build an autonomous agent for your workflow? Start a project →

Need this built for you?

I build AI agents and automated workflows. Async delivery. No meetings. Flat rate.

Start a Project