About autoresearch

AutoResearch gives an AI coding agent a single editable training file, a frozen evaluator, and a scalar metric (val_bpb), then runs an autonomous loop that modifies code, trains for 5 minutes, checks if the result improved, and keeps or reverts. It requires just one GPU, one file, and one metric — no distributed training or complex configs — and can run approximately 100 experiments overnight while the researcher sleeps.

Key Features

Autonomous keep-or-revert experiment loop running ~12 experiments/hour with zero human intervention
Single-GPU, single-file, single-metric design with no external dependencies beyond PyTorch
Research directions specified in a simple markdown file that the agent follows
Fixed 5-minute training window per experiment ensuring consistent evaluation across hardware
Full experiment history with automatic rollback on regressions for safe overnight runs

Overview

AutoResearch is Andrej Karpathy’s framework for running autonomous ML research experiments. It gives an AI coding agent a single training file, a frozen evaluator, and one scalar metric, then runs a loop: modify code, train for 5 minutes, check if the result improved, keep or revert. The pattern is intentionally minimal — one GPU, one file, one metric — making it accessible to individual researchers and small teams.

Key Capabilities

The core loop is deceptively simple but powerful at scale. Each experiment takes roughly 5 minutes, so one GPU running overnight can evaluate approximately 100 different modifications. The agent reads research directions from a markdown file (program.md) and systematically works through them, maintaining a full history of what was tried and what worked. When an experiment regresses the metric, the system automatically rolls back to the last good state.

Use Cases

AutoResearch is designed for neural network training optimization — finding better architectures, hyperparameters, and training techniques through brute-force automated experimentation. The pattern has been adapted beyond its original ML training context to prompt optimization, GPU kernel tuning, and build-time reduction. Any problem with a single scalar metric, a fast evaluation loop, and a modifiable codebase can benefit from this approach.

Considerations

The framework requires a clear scalar metric that can be evaluated in a short fixed window. Problems without a clean evaluation function or with long training times may not map well to the autoresearch loop. The agent’s modifications are bounded by its understanding of the codebase and the research directions provided, so the quality of the program.md specification directly affects the quality of experiments explored.

Who It’s For

AutoResearch is ideal for ML researchers who want to run large-scale automated experiments on a single GPU, teams exploring architecture and hyperparameter spaces, and developers who want to apply the keep-or-revert loop pattern to their own optimization problems. Karpathy designed it for the solo researcher who wants to come back in the morning to a set of validated improvements.

autoresearch

About autoresearch

Key Features

Overview

Key Capabilities

Use Cases

Considerations

Who It’s For

Similar Agents

Consensus

Elicit

Grok