EveryDev.ai
Sign inSubscribe
AI Tools by Topic
  • AI Coding Assistants
  • Agent Frameworks
  • MCP Servers
  • AI Prompt Tools
  • Vibe Coding Tools
  • AI Design Tools
  • AI Database Tools
  • AI Website Builders
  • AI Testing Tools
  • LLM Evaluations
Follow Us
  • X / Twitter
  • LinkedIn
  • Reddit
  • Discord
  • Threads
  • Bluesky
  • Mastodon
  • YouTube
  • GitHub
  • Instagram
Get Started
  • About
  • Editorial Standards
  • Corrections & Disclosures
  • Community Guidelines
  • Advertise
  • Contact Us
  • Newsletter
  • Submit a Tool
  • Start a Discussion
  • Write A Blog
  • Share A Build
  • Terms of Service
  • Privacy Policy
Explore with AI
  • ChatGPT
  • Gemini
  • Claude
  • Grok
  • Perplexity
Agent Experience
  • llms.txt
Theme
With AI, Everyone is a Dev. EveryDev.ai © 2026
Main Menu
  • Tools
  • Developers
  • Topics
  • Discussions
  • Communities
  • News
  • Podcasts
  • Blogs
  • Builds
  • Contests
  • Compare
  • Arena
  • Polls
Create
    Home
    Tools

    2,645+ AI tools

    • New
    • Trending
    • Featured
    • Compare
    • Arena
    Categories
    • Agents1666
    • Coding1214
    • Infrastructure542
    • Marketing451
    • Design437
    • Projects396
    • Research371
    • Analytics339
    • Testing233
    • MCP227
    • Data213
    • Security200
    • Integration170
    • Learning155
    • Communication148
    • Prompts144
    • Extensions137
    • Commerce125
    • Voice122
    • DevOps99
    • Web78
    • Finance21
    1. Home
    2. Tools
    3. nanoRL
    nanoRL icon

    nanoRL

    AI Development Libraries

    Minimal, single-file implementations of SFT, DPO, GRPO, and PPO for fine-tuning language models after pretraining, each ~100-180 lines and runnable on a single GPU.

    Visit Website

    At a Glance

    Pricing
    Open Source

    Fully free and open source under the MIT license. No cost to use, modify, or distribute.

    Engagement

    Available On

    macOS
    API
    CLI

    Resources

    WebsiteDocsGitHubllms.txt

    Topics

    AI Development LibrariesHuman-in-the-Loop TrainingLLM Orchestration

    Alternatives

    rlmLudwigagent-proxy-kit
    Developer
    Ethan HeEthan He builds minimal, educational implementations of mach…

    Listed Jun 2026

    About nanoRL

    nanoRL is an open-source educational library by Ethan He that provides minimal, self-contained implementations of the four most common post-pretraining fine-tuning algorithms for language models. Each file is ~100–180 lines of Python, converges on a toy arithmetic task in 30 steps on a single GPU or an M-series Mac via MPS, and is released under the MIT license.

    What It Is

    nanoRL covers SFT (supervised fine-tuning), DPO (direct preference optimization), GRPO (group relative policy optimization, as used in DeepSeek-R1), and PPO (proximal policy optimization with a separate critic transformer, the InstructGPT setup). The project is explicitly inspired by Andrej Karpathy's nanoGPT and shares the same didactic goal: make each algorithm readable end-to-end in a single file, without the scaffolding of production RLHF stacks.

    The Four Algorithms and Their Supervision Axis

    The four files are organized along a single axis — what kind of supervision is available:

    • SFT — needs full demonstrations (prompt, target); one model loaded; masked cross-entropy loss.
    • DPO — needs preference pairs (prompt, chosen, rejected); two models (policy + frozen reference); no rollouts required.
    • GRPO — needs only a reward function; two models (policy + reference); group-mean baseline replaces a value model.
    • PPO — needs only a reward function; three models (policy, reference, critic transformer); GAE advantage estimation.

    Reading them in order shows how each algorithm adds machinery to handle progressively weaker supervision signals.

    Toy Task and Convergence

    All four scripts load Qwen/Qwen2.5-0.5B-Instruct (or a copy for reference/critic roles), train for 30 steps on a binary-reward 1-digit arithmetic task (e.g., "What is 3 + 8?" → <answer>11</answer>), and print loss/reward/grad-norm per step. The README documents expected output for each algorithm, including the known DPO over-optimization quirk where rejected log-probs are driven toward −∞ even after chosen is saturated.

    Scaling Up: GRPO on GSM8K and the Autoresearch Loop

    Three companion files scale GRPO from the toy task to GSM8K (grade-school math word problems with verifiable final-answer rewards):

    • gsm8k_grpo.py — textbook GRPO with reference model, KL penalty, and PPO clip.
    • gsm8k_sft_grpo.py — standard RLVR pipeline: SFT warm-up on gold solutions followed by GRPO, with three eval checkpoints.
    • gsm8k_grpo_autoresearch.py — the output of an autonomous overnight experiment loop.

    The autoresearch loop ran 82 tuning experiments unattended, following a protocol inspired by karpathy/autoresearch. The key finding: the changes that survived all removed machinery — dropping the reference model and KL term, dropping the PPO clip (which never fires at minibatch=1), and dropping temperature annealing. What remained was plain REINFORCE with a group-relative baseline, statistically tied with the full textbook setup on GSM8K at this scale.

    What Is Intentionally Omitted

    nanoRL explicitly skips production RLHF machinery found in TRL, OpenRLHF, veRL, or DeepSpeed-Chat:

    • No trained reward model (toy uses a hard-coded reward_fn)
    • No per-token KL penalty folded into the reward stream
    • No distributed training (single process, single device)
    • No vLLM rollouts (sequential torch.multinomial, slow but readable)
    • No advantage whitening in PPO
    • No reward-model-based critic initialization

    Each omission is documented inline. The README frames adding them back as a tractable exercise.

    Setup Path

    Installation requires Python with either uv sync (recommended, using the fast uv package manager) or pip install -e .. Each script is independent and runs directly: uv run minimal_sft.py. First runs download the Qwen2.5-0.5B-Instruct model (~1GB); the toy scripts converge in seconds, while GSM8K training takes minutes on a single GPU or M-series Mac.

    Community Discussions

    Be the first to start a conversation about nanoRL

    Share your experience with nanoRL, ask questions, or help others learn from your insights.

    Pricing

    OPEN SOURCE

    Open Source

    Fully free and open source under the MIT license. No cost to use, modify, or distribute.

    • SFT implementation
    • DPO implementation
    • GRPO implementation
    • PPO implementation
    • GSM8K scaling scripts

    Capabilities

    Key Features

    • Single-file SFT implementation (~100-180 lines)
    • Single-file DPO implementation with reference model
    • Single-file GRPO implementation (DeepSeek-R1 style)
    • Single-file PPO implementation with separate critic transformer
    • Toy arithmetic task convergence in 30 steps on a single GPU
    • M-series Mac support via MPS backend
    • GSM8K scaling scripts for GRPO
    • SFT+GRPO pipeline script for RLVR
    • Autoresearch loop output from 82 overnight experiments
    • MIT licensed and fully open source
    • Compatible with Qwen2.5-0.5B-Instruct and Gemma-3-270m-it
    • GAE advantage estimation in PPO
    • Group-relative baseline in GRPO
    • Inline documentation of all design decisions

    Integrations

    Qwen2.5-0.5B-Instruct
    Google Gemma-3-270m-it
    GSM8K dataset
    Hugging Face Transformers
    PyTorch
    uv package manager
    API Available
    View Docs

    Reviews & Ratings

    No ratings yet

    Be the first to rate nanoRL and help others make informed decisions.

    Developer

    Ethan He

    Ethan He builds minimal, educational implementations of machine learning algorithms and publishes them as open-source projects on GitHub. His work focuses on making complex reinforcement learning and fine-tuning techniques readable and accessible to practitioners. nanoRL follows the didactic tradition of nanoGPT, distilling production RLHF algorithms into single self-contained files.

    Read more about Ethan He
    WebsiteGitHub
    1 tool in directory

    Similar Tools

    rlm icon

    rlm

    A reinforcement learning library for training language models, providing tools and utilities for RL-based fine-tuning of LLMs.

    Ludwig icon

    Ludwig

    Ludwig is a low-code, declarative deep learning framework for building custom AI models including LLMs and neural networks using YAML configuration files.

    agent-proxy-kit icon

    agent-proxy-kit

    A lightweight TypeScript library that normalizes long-running agent and LLM streams into stable, provider-agnostic progress events for UI rendering.

    Browse all tools

    Related Topics

    AI Development Libraries

    Programming libraries and frameworks that provide machine learning capabilities, model integration, and AI functionality for developers.

    198 tools

    Human-in-the-Loop Training

    Platforms that connect organizations with vetted human experts to annotate, label, evaluate, and align AI models, ensuring high-quality training datasets and accurate model evaluation through human judgment.

    29 tools

    LLM Orchestration

    Platforms and frameworks for designing, managing, and deploying complex LLM workflows with visual interfaces, allowing for the coordination of multiple AI models and services.

    139 tools
    Browse all topics
    Back to all tools
    Discussions