# autoresearch

> Autonomous AI agent that iteratively experiments on single-GPU LLM training code overnight while you sleep.

autoresearch is an open-source framework that puts an AI agent in charge of running machine learning experiments autonomously. You point a coding agent (Claude, Codex, or similar) at a small but real LLM training setup, and the agent modifies the training code, runs a 5-minute training experiment, checks whether the validation metric improved, keeps or discards the change, and repeats—producing a log of experiments and (hopefully) a better model by morning.

The training base is a simplified single-GPU implementation of nanochat (a GPT-style model). The human's role shifts from writing Python to writing `program.md`—a Markdown file that acts as the agent's standing instructions and research strategy. The agent exclusively edits `train.py`, which contains the full model architecture, optimizer, and training loop.

- **Fixed Time Budget** - Every experiment runs for exactly 5 wall-clock minutes, making results directly comparable regardless of architecture or hyperparameter changes
- **Single Metric** - Validation bits-per-byte (val_bpb) is the objective; lower is better and vocab-size-independent, so architectural changes are fairly compared
- **Single File Editing** - The agent only modifies `train.py`, keeping diffs small and reviewable
- **program.md Interface** - Human researchers guide the agent by editing a Markdown instruction file rather than Python code
- **Self-Contained** - No distributed training, no complex configs; one GPU, one file, one metric
- **Autonomous Loop** - Run ~12 experiments per hour; ~100 experiments while you sleep
- **MIT Licensed** - Fully open source with no restrictions

To get started, clone the repository, install dependencies via `uv`, run `prepare.py` once to download data and train a BPE tokenizer, then spin up your AI coding agent pointed at `program.md`.

## Features
- Autonomous AI agent loop: modify → train → evaluate → keep or discard
- Fixed 5-minute wall-clock training budget per experiment
- Validation bits-per-byte (val_bpb) as the single comparable metric
- Agent-editable train.py with full GPT model, Muon+AdamW optimizer, and training loop
- Human-editable program.md for setting agent research strategy
- ~100 experiments possible in a single overnight run
- Single NVIDIA GPU support (tested on H100)
- MIT license — fully open source
- Built on nanochat, a minimal GPT training codebase

## Integrations
Claude (Anthropic), OpenAI Codex, uv package manager, PyTorch, nanochat

## Platforms
LINUX, DEVELOPER_SDK

## Pricing
Open Source

## Links
- Website: https://github.com/karpathy/autoresearch
- Repository: https://github.com/karpathy/autoresearch
- EveryDev.ai: https://www.everydev.ai/tools/autoresearch