nanoRL
Minimal, single-file implementations of SFT, DPO, GRPO, and PPO for fine-tuning language models after pretraining, each ~100-180 lines and runnable on a single GPU.
At a Glance
Fully free and open source under the MIT license. No cost to use, modify, or distribute.
Engagement
Available On
Alternatives
Listed Jun 2026
About nanoRL
nanoRL is an open-source educational library by Ethan He that provides minimal, self-contained implementations of the four most common post-pretraining fine-tuning algorithms for language models. Each file is ~100–180 lines of Python, converges on a toy arithmetic task in 30 steps on a single GPU or an M-series Mac via MPS, and is released under the MIT license.
What It Is
nanoRL covers SFT (supervised fine-tuning), DPO (direct preference optimization), GRPO (group relative policy optimization, as used in DeepSeek-R1), and PPO (proximal policy optimization with a separate critic transformer, the InstructGPT setup). The project is explicitly inspired by Andrej Karpathy's nanoGPT and shares the same didactic goal: make each algorithm readable end-to-end in a single file, without the scaffolding of production RLHF stacks.
The Four Algorithms and Their Supervision Axis
The four files are organized along a single axis — what kind of supervision is available:
- SFT — needs full demonstrations
(prompt, target); one model loaded; masked cross-entropy loss. - DPO — needs preference pairs
(prompt, chosen, rejected); two models (policy + frozen reference); no rollouts required. - GRPO — needs only a reward function; two models (policy + reference); group-mean baseline replaces a value model.
- PPO — needs only a reward function; three models (policy, reference, critic transformer); GAE advantage estimation.
Reading them in order shows how each algorithm adds machinery to handle progressively weaker supervision signals.
Toy Task and Convergence
All four scripts load Qwen/Qwen2.5-0.5B-Instruct (or a copy for reference/critic roles), train for 30 steps on a binary-reward 1-digit arithmetic task (e.g., "What is 3 + 8?" → <answer>11</answer>), and print loss/reward/grad-norm per step. The README documents expected output for each algorithm, including the known DPO over-optimization quirk where rejected log-probs are driven toward −∞ even after chosen is saturated.
Scaling Up: GRPO on GSM8K and the Autoresearch Loop
Three companion files scale GRPO from the toy task to GSM8K (grade-school math word problems with verifiable final-answer rewards):
gsm8k_grpo.py— textbook GRPO with reference model, KL penalty, and PPO clip.gsm8k_sft_grpo.py— standard RLVR pipeline: SFT warm-up on gold solutions followed by GRPO, with three eval checkpoints.gsm8k_grpo_autoresearch.py— the output of an autonomous overnight experiment loop.
The autoresearch loop ran 82 tuning experiments unattended, following a protocol inspired by karpathy/autoresearch. The key finding: the changes that survived all removed machinery — dropping the reference model and KL term, dropping the PPO clip (which never fires at minibatch=1), and dropping temperature annealing. What remained was plain REINFORCE with a group-relative baseline, statistically tied with the full textbook setup on GSM8K at this scale.
What Is Intentionally Omitted
nanoRL explicitly skips production RLHF machinery found in TRL, OpenRLHF, veRL, or DeepSpeed-Chat:
- No trained reward model (toy uses a hard-coded
reward_fn) - No per-token KL penalty folded into the reward stream
- No distributed training (single process, single device)
- No vLLM rollouts (sequential
torch.multinomial, slow but readable) - No advantage whitening in PPO
- No reward-model-based critic initialization
Each omission is documented inline. The README frames adding them back as a tractable exercise.
Setup Path
Installation requires Python with either uv sync (recommended, using the fast uv package manager) or pip install -e .. Each script is independent and runs directly: uv run minimal_sft.py. First runs download the Qwen2.5-0.5B-Instruct model (~1GB); the toy scripts converge in seconds, while GSM8K training takes minutes on a single GPU or M-series Mac.
Community Discussions
Be the first to start a conversation about nanoRL
Share your experience with nanoRL, ask questions, or help others learn from your insights.
Pricing
Open Source
Fully free and open source under the MIT license. No cost to use, modify, or distribute.
- SFT implementation
- DPO implementation
- GRPO implementation
- PPO implementation
- GSM8K scaling scripts
Capabilities
Key Features
- Single-file SFT implementation (~100-180 lines)
- Single-file DPO implementation with reference model
- Single-file GRPO implementation (DeepSeek-R1 style)
- Single-file PPO implementation with separate critic transformer
- Toy arithmetic task convergence in 30 steps on a single GPU
- M-series Mac support via MPS backend
- GSM8K scaling scripts for GRPO
- SFT+GRPO pipeline script for RLVR
- Autoresearch loop output from 82 overnight experiments
- MIT licensed and fully open source
- Compatible with Qwen2.5-0.5B-Instruct and Gemma-3-270m-it
- GAE advantage estimation in PPO
- Group-relative baseline in GRPO
- Inline documentation of all design decisions