fullautoresearch

Comparative analysis of Claude Sonnet 4, Sonnet 4.6, Opus 4.6, and GPT-4.1 as fully autonomous ML researchers.

Overview

What is fullautoresearch?

An open-source framework that gives AI models full autonomy to design, implement, and evaluate ML experiments. Each model operates as a complete research agent -- reading data, writing code, debugging failures, and iterating on results -- with no human in the loop.

Autonomous Experimentation

Models design ML experiments, write training scripts, handle errors, and evaluate results independently.

Four-Model Comparison

Claude Sonnet 4, Sonnet 4.6, Opus 4.6, and GPT-4.1 on identical benchmarks.

Reproducible Pipeline

Every experiment is logged, versioned, and reproducible. Setup scripts and structured output included.

Cost Tracking

Per-experiment and per-model cost tracking. Know exactly what you spent.

Phase 1 Results

Exploratory Sequential Optimization

362 experiments across three Anthropic models. Each model built on the previous model's best result -- different baselines, so improvement percentages are not directly comparable.

Keep Rate vs Crash Rate

Skip Rate (instruction-following reliability)

Loss Improvement from Baseline (%)

Model	Exps	Kept	Disc.	Crashed	Skipped	Keep %	Crash %	Skip %	Baseline	Best	Improv.	Cost
Sonnet 4	147	5	45	50	47	5.2%	34%	32%	0.948083	0.936221	1.25%	~$15
Sonnet 4.6	104	21	70	9	4	22.1%	8.7%	4%	1.245301	0.955865	23.24%	~$11
Opus 4.6 (32k)	111	8	64	21	18	8.9%	18.9%	16%	0.955747	0.901860	5.64%	~$55

Keep rate computed over non-crashed experiments. Skipped = LLM returned unparseable response. Different baselines per model (sequential design).

Key Finding

Extended Thinking Enables Multi-Step Reasoning

Opus 4.6 with 32K extended thinking discovered an optimization chain that required understanding second-order effects -- something the shorter-context models could not do.

The grad_accum=1 Discovery Chain

Removed gradient accumulation

grad_accum 2 -> 1, doubling optimizer steps within the 5-minute training window.

Recognized doubled weight decay

More optimizer steps = more weight decay applications per run. The model predicted this side effect.

Halved weight decay to compensate

Adjusted weight decay down to offset the increased application frequency.

Increased matrix learning rate

With more frequent but smaller effective steps, raised the Muon learning rate for matrix parameters.

Added exponential x0_lambda gradient

Introduced exponential scheduling for initial parameter regularization -- depth-aware skip connections.

Each step depends on the previous one. Without extended thinking, the model would need multiple experiment rounds to discover each compensation independently. Opus proposed steps 1-4 as a single coherent change that succeeded on the first attempt.

In Progress

Phase 2: Controlled Head-to-Head Comparison

Phase 1 had different baselines per model, making direct comparison unreliable. Phase 2 fixes this -- all models start from byte-identical code with the same unoptimized baseline.

Design

Identical baseline: all 4 models start from val_bpb 1.097 (upstream Karpathy defaults)
Replication: n=3 runs per model, 12 total runs x 100 experiments each
Models: Sonnet 4, Sonnet 4.6, Opus 4.6, GPT-4.1
Hardware: RTX 5070 Ti (Blackwell, SM 12.0)
Frozen commit: 3fb6704 (byte-identical starting code)

What Phase 2 Enables

Statistical comparison with confidence intervals
Reproducibility assessment via variance across runs
Learning curve analysis -- how fast each model finds key optimizations
Strategy diversity -- do different runs discover the same paths?

Results will be published in v2 of the paper.

Runs	Model	API	Status	Est. Cost
1-3	GPT-4.1	Azure OpenAI	In progress	$0
4-6	Sonnet 4.6	Anthropic	Planned	$18
7-9	Sonnet 4	Anthropic	Planned	$18
10-12	Opus 4.6	Anthropic	Planned	$150
Total (12 runs x 100 experiments)				$186

Getting Started

Quick Start

Running in under five minutes. Copy each block directly.

1Clone and Setup

# Clone and setup
git clone https://github.com/bmdhodl/fullautoresearch.git
cd fullautoresearch
bash scripts/setup.sh

2Run Pre-flight Tests

# Run pre-flight tests
bash scripts/test.sh

3Run with Sonnet 4.6 (Recommended)

# Run with Sonnet 4.6 (recommended)
AUTORESEARCH_DEPTH=8 AUTORESEARCH_BATCH_SIZE=16 \
  bash scripts/run_forever.sh --dataset pubmed --tag my-run

4Run with Opus 4.6 (Higher Cost)

# Run with Opus 4.6
bash scripts/run_forever.sh --opus --dataset pubmed --tag my-opus-run

5Run with GPT-4.1 via Azure

# Run with GPT-4.1 via Azure
bash scripts/run_forever.sh --azure gpt-4.1 --dataset pubmed --tag my-azure-run

Cost Estimation

Experiment Cost Calculator

API cost per experiment (March 2026 pricing). GPU electricity not included -- see Table 5 in the paper for full breakdown.

Model

Experiments

$6.00

Estimated API Cost

100 experiments x $0.06/exp (Sonnet 4.6)

Resources

Links

GitHub Repository

Source code, issues, discussions

Paper (PDF)

Full write-up with methodology and results

Project Wiki

Guides, configuration, FAQ

karpathy/autoresearch

Original project by Andrej Karpathy

Acknowledgments

Thank You

This project builds on Andrej Karpathy's original autoresearch framework.

Built with PyTorch, Triton, HuggingFace, and Rich. Model APIs from Anthropic and Microsoft Azure / OpenAI.

Thanks to Renato Umeton, Ph.D. for publication guidance and championing open-source ML research, Dave Graham from ML Commons for parallel research collaboration and accountability, and the LinkedIn ML/NLP community for feedback.