[bmdpat]
All writing
8 min read

How I Let an AI Agent Run 100 ML Experiments Overnight on a $500 GPU

I let an autonomous agent run 100 ML experiments while I slept. 7 succeeded. Net result: 25% model improvement. Here's the setup.

Share LinkedIn

How I Let an AI Agent Run 100 ML Experiments Overnight on a $500 GPU

Last week I let an AI agent run 100 machine learning experiments overnight on my RTX 3070. I woke up to a 25% model improvement. Here's exactly how it works.

The Setup

The agent is built on Karpathy's autoresearch concept, powered by Claude Sonnet. It runs in a loop:

  1. Propose — The agent analyzes current model performance and proposes a specific code change
  2. Implement — It writes the actual Python code to modify the neural network
  3. Train — The modified model trains on PubMed medical text data
  4. Evaluate — Loss metrics are compared against the baseline
  5. Decide — If improvement > threshold, keep the change. Otherwise, revert.
  6. Repeat — Go back to step 1 with updated context

The Results

Out of 100 experiments:

  • 93 failed — proposed changes made the model worse or had no effect
  • 7 succeeded — measurable improvements that the agent kept
  • Net result — 25% improvement in model performance

The 7% hit rate sounds low, but that's the point. Research is mostly failure. The agent runs experiments I'd never have time to try manually.

What the Agent Discovered

The 7 successful experiments included:

  • Learning rate scheduling changes I wouldn't have tried
  • A specific attention head configuration that improved convergence
  • Batch size adjustments that were counterintuitive but worked
  • Layer normalization placement that contradicted my assumptions

The Hardware

This runs on consumer hardware:

  • GPU: NVIDIA RTX 3070 (8GB VRAM) — ~$500
  • CPU: Standard desktop AMD Ryzen
  • RAM: 32GB
  • Storage: 1TB NVMe SSD

Total cost for the overnight run: about $0.50 in electricity + Claude API calls. That near-zero per-experiment cost comes from running inference locally rather than through a cloud provider. Here's what serving a live LLM from a consumer GPU actually looks like in production.

Why This Matters

The traditional ML research loop is: human thinks of experiment → human implements it → human waits for training → human evaluates → human thinks of next experiment.

Each cycle takes hours or days of human attention. My agent does it in minutes and runs 24/7.

The Code

The agent is ~300 lines of Python orchestrating:

  • Claude Sonnet for reasoning and code generation
  • PyTorch for training
  • A simple SQLite database tracking all experiments
  • Git for version control of each experiment

It's not magic. It's a loop with good prompts and clear evaluation criteria.

What I Learned

  1. Autonomy requires clear metrics — the agent needs an unambiguous way to measure success
  2. Failure is the feature — 93% failure rate is fine when experiments are cheap
  3. Consumer hardware is enough — you don't need cloud GPUs for meaningful research
  4. Overnight is the killer use case — run experiments while you sleep, review results over coffee

Try It Yourself

You need:

  • A GPU (even a 3060 works)
  • An API key for Claude or GPT
  • A clear metric to optimize
  • Patience to debug the loop

The hardest part isn't the code — it's defining what "better" means for your specific model.

Curious about commissioning something like this rather than building it yourself? Here's what custom autonomous agents actually cost in 2026.


Want me to build an autonomous agent for your workflow? Start a project →

PH

Patrick Hughes

Building BMD HODL — a one-person AI-operated holding company. Nashville, Tennessee. Twenty-Two agents.

Want more like this?

AI agent builds, real costs, what works. One email per week. No fluff.

More writing