nanoRL

Name: nanoRL
Availability: OnlineOnly
Author: Ethan He

Minimal, single-file implementations of SFT, DPO, GRPO, and PPO for fine-tuning language models after pretraining, each ~100-180 lines and runnable on a single GPU.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open source under the MIT license. No cost to use, modify, or distribute.

Engagement

Available On

macOS

API

CLI

Ethan HeEthan He builds minimal, educational implementations of mach…

Listed Jun 2026

About nanoRL

nanoRL is an open-source educational library by Ethan He that provides minimal, self-contained implementations of the four most common post-pretraining fine-tuning algorithms for language models. Each file is ~100–180 lines of Python, converges on a toy arithmetic task in 30 steps on a single GPU or an M-series Mac via MPS, and is released under the MIT license.

What It Is

nanoRL covers SFT (supervised fine-tuning), DPO (direct preference optimization), GRPO (group relative policy optimization, as used in DeepSeek-R1), and PPO (proximal policy optimization with a separate critic transformer, the InstructGPT setup). The project is explicitly inspired by Andrej Karpathy's nanoGPT and shares the same didactic goal: make each algorithm readable end-to-end in a single file, without the scaffolding of production RLHF stacks.

The Four Algorithms and Their Supervision Axis

The four files are organized along a single axis — what kind of supervision is available:

SFT — needs full demonstrations (prompt, target); one model loaded; masked cross-entropy loss.
DPO — needs preference pairs (prompt, chosen, rejected); two models (policy + frozen reference); no rollouts required.
GRPO — needs only a reward function; two models (policy + reference); group-mean baseline replaces a value model.
PPO — needs only a reward function; three models (policy, reference, critic transformer); GAE advantage estimation.

Reading them in order shows how each algorithm adds machinery to handle progressively weaker supervision signals.

Toy Task and Convergence

All four scripts load Qwen/Qwen2.5-0.5B-Instruct (or a copy for reference/critic roles), train for 30 steps on a binary-reward 1-digit arithmetic task (e.g., "What is 3 + 8?" → <answer>11</answer>), and print loss/reward/grad-norm per step. The README documents expected output for each algorithm, including the known DPO over-optimization quirk where rejected log-probs are driven toward −∞ even after chosen is saturated.

Scaling Up: GRPO on GSM8K and the Autoresearch Loop

Three companion files scale GRPO from the toy task to GSM8K (grade-school math word problems with verifiable final-answer rewards):

gsm8k_grpo.py — textbook GRPO with reference model, KL penalty, and PPO clip.
gsm8k_sft_grpo.py — standard RLVR pipeline: SFT warm-up on gold solutions followed by GRPO, with three eval checkpoints.
gsm8k_grpo_autoresearch.py — the output of an autonomous overnight experiment loop.

The autoresearch loop ran 82 tuning experiments unattended, following a protocol inspired by karpathy/autoresearch. The key finding: the changes that survived all removed machinery — dropping the reference model and KL term, dropping the PPO clip (which never fires at minibatch=1), and dropping temperature annealing. What remained was plain REINFORCE with a group-relative baseline, statistically tied with the full textbook setup on GSM8K at this scale.

What Is Intentionally Omitted

nanoRL explicitly skips production RLHF machinery found in TRL, OpenRLHF, veRL, or DeepSpeed-Chat:

No trained reward model (toy uses a hard-coded reward_fn)
No per-token KL penalty folded into the reward stream
No distributed training (single process, single device)
No vLLM rollouts (sequential torch.multinomial, slow but readable)
No advantage whitening in PPO
No reward-model-based critic initialization

Each omission is documented inline. The README frames adding them back as a tractable exercise.

Setup Path

Installation requires Python with either uv sync (recommended, using the fast uv package manager) or pip install -e .. Each script is independent and runs directly: uv run minimal_sft.py. First runs download the Qwen2.5-0.5B-Instruct model (~1GB); the toy scripts converge in seconds, while GSM8K training takes minutes on a single GPU or M-series Mac.

Community Discussions

Be the first to start a conversation about nanoRL

Share your experience with nanoRL, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully free and open source under the MIT license. No cost to use, modify, or distribute.

SFT implementation
DPO implementation
GRPO implementation
PPO implementation
GSM8K scaling scripts

Capabilities

Key Features

Single-file SFT implementation (~100-180 lines)
Single-file DPO implementation with reference model
Single-file GRPO implementation (DeepSeek-R1 style)
Single-file PPO implementation with separate critic transformer
Toy arithmetic task convergence in 30 steps on a single GPU
M-series Mac support via MPS backend
GSM8K scaling scripts for GRPO
SFT+GRPO pipeline script for RLVR
Autoresearch loop output from 82 overnight experiments
MIT licensed and fully open source
Compatible with Qwen2.5-0.5B-Instruct and Gemma-3-270m-it
GAE advantage estimation in PPO
Group-relative baseline in GRPO
Inline documentation of all design decisions

Integrations

Qwen2.5-0.5B-Instruct

Google Gemma-3-270m-it

GSM8K dataset

Hugging Face Transformers

PyTorch

uv package manager

API Available

View Docs

Back to all tools Suggest an edit

nanoRL

AI Development Libraries

Minimal, single-file implementations of SFT, DPO, GRPO, and PPO for fine-tuning language models after pretraining, each ~100-180 lines and runnable on a single GPU.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open source under the MIT license. No cost to use, modify, or distribute.

Engagement

ratings

discussions

4views

Available On

macOS

API

CLI

Resources

Website Docs GitHub llms.txt

Topics

AI Development Libraries Human-in-the-Loop Training LLM Orchestration

Alternatives

rlm Axolotl Unsloth

Developer

Ethan HeEthan He builds minimal, educational implementations of mach…

Listed Jun 2026

About nanoRL

What It Is

The Four Algorithms and Their Supervision Axis

The four files are organized along a single axis — what kind of supervision is available:

SFT — needs full demonstrations (prompt, target); one model loaded; masked cross-entropy loss.
DPO — needs preference pairs (prompt, chosen, rejected); two models (policy + frozen reference); no rollouts required.
GRPO — needs only a reward function; two models (policy + reference); group-mean baseline replaces a value model.
PPO — needs only a reward function; three models (policy, reference, critic transformer); GAE advantage estimation.

Reading them in order shows how each algorithm adds machinery to handle progressively weaker supervision signals.

Toy Task and Convergence

Scaling Up: GRPO on GSM8K and the Autoresearch Loop

Three companion files scale GRPO from the toy task to GSM8K (grade-school math word problems with verifiable final-answer rewards):

gsm8k_grpo.py — textbook GRPO with reference model, KL penalty, and PPO clip.
gsm8k_sft_grpo.py — standard RLVR pipeline: SFT warm-up on gold solutions followed by GRPO, with three eval checkpoints.
gsm8k_grpo_autoresearch.py — the output of an autonomous overnight experiment loop.

What Is Intentionally Omitted

nanoRL explicitly skips production RLHF machinery found in TRL, OpenRLHF, veRL, or DeepSpeed-Chat:

No trained reward model (toy uses a hard-coded reward_fn)
No per-token KL penalty folded into the reward stream
No distributed training (single process, single device)
No vLLM rollouts (sequential torch.multinomial, slow but readable)
No advantage whitening in PPO
No reward-model-based critic initialization

Each omission is documented inline. The README frames adding them back as a tractable exercise.

Setup Path

Community Discussions

Be the first to start a conversation about nanoRL

Share your experience with nanoRL, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully free and open source under the MIT license. No cost to use, modify, or distribute.

SFT implementation
DPO implementation
GRPO implementation
PPO implementation
GSM8K scaling scripts

Capabilities

Key Features

Single-file SFT implementation (~100-180 lines)
Single-file DPO implementation with reference model
Single-file GRPO implementation (DeepSeek-R1 style)
Single-file PPO implementation with separate critic transformer
Toy arithmetic task convergence in 30 steps on a single GPU
M-series Mac support via MPS backend
GSM8K scaling scripts for GRPO
SFT+GRPO pipeline script for RLVR
Autoresearch loop output from 82 overnight experiments
MIT licensed and fully open source
Compatible with Qwen2.5-0.5B-Instruct and Gemma-3-270m-it
GAE advantage estimation in PPO
Group-relative baseline in GRPO
Inline documentation of all design decisions

Integrations

Qwen2.5-0.5B-Instruct

Google Gemma-3-270m-it

GSM8K dataset

Hugging Face Transformers

PyTorch

uv package manager

API Available

View Docs

Back to all tools Suggest an edit