# nanoRL

> Minimal, single-file implementations of SFT, DPO, GRPO, and PPO for fine-tuning language models after pretraining, each ~100-180 lines and runnable on a single GPU.

nanoRL is an open-source educational library by Ethan He that provides minimal, self-contained implementations of the four most common post-pretraining fine-tuning algorithms for language models. Each file is ~100–180 lines of Python, converges on a toy arithmetic task in 30 steps on a single GPU or an M-series Mac via MPS, and is released under the MIT license.

## What It Is

nanoRL covers SFT (supervised fine-tuning), DPO (direct preference optimization), GRPO (group relative policy optimization, as used in DeepSeek-R1), and PPO (proximal policy optimization with a separate critic transformer, the InstructGPT setup). The project is explicitly inspired by Andrej Karpathy's nanoGPT and shares the same didactic goal: make each algorithm readable end-to-end in a single file, without the scaffolding of production RLHF stacks.

## The Four Algorithms and Their Supervision Axis

The four files are organized along a single axis — what kind of supervision is available:

- **SFT** — needs full demonstrations `(prompt, target)`; one model loaded; masked cross-entropy loss.
- **DPO** — needs preference pairs `(prompt, chosen, rejected)`; two models (policy + frozen reference); no rollouts required.
- **GRPO** — needs only a reward function; two models (policy + reference); group-mean baseline replaces a value model.
- **PPO** — needs only a reward function; three models (policy, reference, critic transformer); GAE advantage estimation.

Reading them in order shows how each algorithm adds machinery to handle progressively weaker supervision signals.

## Toy Task and Convergence

All four scripts load `Qwen/Qwen2.5-0.5B-Instruct` (or a copy for reference/critic roles), train for 30 steps on a binary-reward 1-digit arithmetic task (e.g., "What is 3 + 8?" → `<answer>11</answer>`), and print loss/reward/grad-norm per step. The README documents expected output for each algorithm, including the known DPO over-optimization quirk where rejected log-probs are driven toward −∞ even after chosen is saturated.

## Scaling Up: GRPO on GSM8K and the Autoresearch Loop

Three companion files scale GRPO from the toy task to GSM8K (grade-school math word problems with verifiable final-answer rewards):

- `gsm8k_grpo.py` — textbook GRPO with reference model, KL penalty, and PPO clip.
- `gsm8k_sft_grpo.py` — standard RLVR pipeline: SFT warm-up on gold solutions followed by GRPO, with three eval checkpoints.
- `gsm8k_grpo_autoresearch.py` — the output of an autonomous overnight experiment loop.

The autoresearch loop ran 82 tuning experiments unattended, following a protocol inspired by karpathy/autoresearch. The key finding: the changes that survived all *removed* machinery — dropping the reference model and KL term, dropping the PPO clip (which never fires at minibatch=1), and dropping temperature annealing. What remained was plain REINFORCE with a group-relative baseline, statistically tied with the full textbook setup on GSM8K at this scale.

## What Is Intentionally Omitted

nanoRL explicitly skips production RLHF machinery found in TRL, OpenRLHF, veRL, or DeepSpeed-Chat:

- No trained reward model (toy uses a hard-coded `reward_fn`)
- No per-token KL penalty folded into the reward stream
- No distributed training (single process, single device)
- No vLLM rollouts (sequential `torch.multinomial`, slow but readable)
- No advantage whitening in PPO
- No reward-model-based critic initialization

Each omission is documented inline. The README frames adding them back as a tractable exercise.

## Setup Path

Installation requires Python with either `uv sync` (recommended, using the fast `uv` package manager) or `pip install -e .`. Each script is independent and runs directly: `uv run minimal_sft.py`. First runs download the Qwen2.5-0.5B-Instruct model (~1GB); the toy scripts converge in seconds, while GSM8K training takes minutes on a single GPU or M-series Mac.

## Features
- Single-file SFT implementation (~100-180 lines)
- Single-file DPO implementation with reference model
- Single-file GRPO implementation (DeepSeek-R1 style)
- Single-file PPO implementation with separate critic transformer
- Toy arithmetic task convergence in 30 steps on a single GPU
- M-series Mac support via MPS backend
- GSM8K scaling scripts for GRPO
- SFT+GRPO pipeline script for RLVR
- Autoresearch loop output from 82 overnight experiments
- MIT licensed and fully open source
- Compatible with Qwen2.5-0.5B-Instruct and Gemma-3-270m-it
- GAE advantage estimation in PPO
- Group-relative baseline in GRPO
- Inline documentation of all design decisions

## Integrations
Qwen2.5-0.5B-Instruct, Google Gemma-3-270m-it, GSM8K dataset, Hugging Face Transformers, PyTorch, uv package manager

## Platforms
MACOS, API, CLI

## Pricing
Open Source

## Version
main

## Links
- Website: https://github.com/ethanhe42/nanoRL
- Documentation: https://github.com/ethanhe42/nanoRL
- Repository: https://github.com/ethanhe42/nanoRL
- EveryDev.ai: https://www.everydev.ai/tools/nanorl
