fullautoresearch

Comparative analysis of Claude Sonnet 4, Sonnet 4.6, Opus 4.6, and GPT-4.1 as fully autonomous ML researchers.

Based on karpathy/autoresearch

Overview

What is fullautoresearch?

An open-source framework that gives AI models full autonomy to design, implement, and evaluate ML experiments. Each model operates as a complete research agent -- reading data, writing code, debugging failures, and iterating on results -- with no human in the loop.

A

Autonomous Experimentation

Models design ML experiments, write training scripts, handle errors, and evaluate results independently.

4

Four-Model Comparison

Claude Sonnet 4, Sonnet 4.6, Opus 4.6, and GPT-4.1 on identical benchmarks.

R

Reproducible Pipeline

Every experiment is logged, versioned, and reproducible. Setup scripts and structured output included.

$

Cost Tracking

Per-experiment and per-model cost tracking. Know exactly what you spent.

Phase 1 Results

Exploratory Sequential Optimization

362 experiments across three Anthropic models. Each model built on the previous model's best result -- different baselines, so improvement percentages are not directly comparable.

Keep Rate vs Crash Rate

Skip Rate (instruction-following reliability)

Loss Improvement from Baseline (%)

ModelExpsKeptDisc.CrashedSkippedKeep %Crash %Skip %BaselineBestImprov.Cost
Sonnet 414754550475.2%34%32%0.9480830.9362211.25%~$15
Sonnet 4.610421709422.1%8.7%4%1.2453010.95586523.24%~$11
Opus 4.6 (32k)11186421188.9%18.9%16%0.9557470.9018605.64%~$55

Keep rate computed over non-crashed experiments. Skipped = LLM returned unparseable response. Different baselines per model (sequential design).

Key Finding

Extended Thinking Enables Multi-Step Reasoning

Opus 4.6 with 32K extended thinking discovered an optimization chain that required understanding second-order effects -- something the shorter-context models could not do.

The grad_accum=1 Discovery Chain

1

Removed gradient accumulation

grad_accum 2 -> 1, doubling optimizer steps within the 5-minute training window.

2

Recognized doubled weight decay

More optimizer steps = more weight decay applications per run. The model predicted this side effect.

3

Halved weight decay to compensate

Adjusted weight decay down to offset the increased application frequency.

4

Increased matrix learning rate

With more frequent but smaller effective steps, raised the Muon learning rate for matrix parameters.

5

Added exponential x0_lambda gradient

Introduced exponential scheduling for initial parameter regularization -- depth-aware skip connections.

Each step depends on the previous one. Without extended thinking, the model would need multiple experiment rounds to discover each compensation independently. Opus proposed steps 1-4 as a single coherent change that succeeded on the first attempt.

In Progress

Phase 2: Controlled Head-to-Head Comparison

Phase 1 had different baselines per model, making direct comparison unreliable. Phase 2 fixes this -- all models start from byte-identical code with the same unoptimized baseline.

Design

  • Identical baseline: all 4 models start from val_bpb 1.097 (upstream Karpathy defaults)
  • Replication: n=3 runs per model, 12 total runs x 100 experiments each
  • Models: Sonnet 4, Sonnet 4.6, Opus 4.6, GPT-4.1
  • Hardware: RTX 5070 Ti (Blackwell, SM 12.0)
  • Frozen commit: 3fb6704 (byte-identical starting code)

What Phase 2 Enables

  • Statistical comparison with confidence intervals
  • Reproducibility assessment via variance across runs
  • Learning curve analysis -- how fast each model finds key optimizations
  • Strategy diversity -- do different runs discover the same paths?

Results will be published in v2 of the paper.

RunsModelAPIStatusEst. Cost
1-3GPT-4.1Azure OpenAIIn progress$0
4-6Sonnet 4.6AnthropicPlanned$18
7-9Sonnet 4AnthropicPlanned$18
10-12Opus 4.6AnthropicPlanned$150
Total (12 runs x 100 experiments)$186

Getting Started

Quick Start

Running in under five minutes. Copy each block directly.

1Clone and Setup
# Clone and setup
git clone https://github.com/bmdhodl/fullautoresearch.git
cd fullautoresearch
bash scripts/setup.sh
2Run Pre-flight Tests
# Run pre-flight tests
bash scripts/test.sh
3Run with Sonnet 4.6 (Recommended)
# Run with Sonnet 4.6 (recommended)
AUTORESEARCH_DEPTH=8 AUTORESEARCH_BATCH_SIZE=16 \
  bash scripts/run_forever.sh --dataset pubmed --tag my-run
4Run with Opus 4.6 (Higher Cost)
# Run with Opus 4.6
bash scripts/run_forever.sh --opus --dataset pubmed --tag my-opus-run
5Run with GPT-4.1 via Azure
# Run with GPT-4.1 via Azure
bash scripts/run_forever.sh --azure gpt-4.1 --dataset pubmed --tag my-azure-run

Cost Estimation

Experiment Cost Calculator

API cost per experiment (March 2026 pricing). GPU electricity not included -- see Table 5 in the paper for full breakdown.

$6.00

Estimated API Cost

100 experiments x $0.06/exp (Sonnet 4.6)

Acknowledgments

Thank You

This project builds on Andrej Karpathy's original autoresearch framework.

Built with PyTorch, Triton, HuggingFace, and Rich. Model APIs from Anthropic and Microsoft Azure / OpenAI.

Thanks to Renato Umeton, Ph.D. for publication guidance and championing open-source ML research, Dave Graham from ML Commons for parallel research collaboration and accountability, and the LinkedIn ML/NLP community for feedback.