# Miles

> Enterprise-grade reinforcement learning framework for large-scale LLM and VLM post-training, featuring high-performance rollout, low-precision training, and production stability.

Miles is an open-source reinforcement learning framework built for enterprise-scale post-training of large language models (LLMs) and vision-language models (VLMs). It is a fork of the slime project, developed jointly by InfiXAI, Ant Group, the SGLang RL Team, and the Miles community. The project launched in November 2025 and is actively maintained under the Apache License 2.0.

## What It Is

Miles sits at the intersection of research-grade RL and production-grade reliability. It integrates SGLang for high-throughput rollout and Megatron-LM for scalable distributed training, targeting the system-level challenges that cause instability and inefficiency when applying reinforcement learning to models at the 1TB+ parameter scale. The framework is designed to be a unified entry point for complex RL workloads including multi-turn interaction, vision-language training, reasoning, coding agents, and multi-agent co-evolution.

## Core Technical Architecture

Miles addresses several fundamental problems in large-scale RL training through system-level innovations:

- **Unified FP8 Pipeline**: End-to-end FP8 sampling and training that eliminates quantization-induced discrepancy between rollout and training, preventing RL collapse in large MoE models.
- **Rollout Routing Replay (R3)**: Records expert routing decisions during SGLang inference and replays them during Megatron training to ensure bit-wise expert alignment in MoE architectures like Qwen3 and DeepSeek-V3.
- **INT4 QAT Support**: Full-stack INT4 W4A16 Quantization-Aware Training pipeline, inspired by the Kimi K2-Thinking report, enabling 1TB-scale models to fit into single-machine VRAM (e.g., NVIDIA H200) and doubling rollout efficiency.
- **Zero-Copy Weight Sync**: Optimized weight refit via CUDA IPC zero-copy mapping, async tensor gathering, and bucketed flattening, reducing sync time by 50% compared to standard HTTP/RPC transfers (per project documentation).
- **Speculative RL Training**: Uses an Online SFT Draft Model that updates during RL to prevent policy drift, achieving 25%+ rollout speedup according to the project's own benchmarks.

## Model Support and Training Scenarios

Miles supports a broad range of state-of-the-art model families:

- **DeepSeek**: R1, V3, V3.2
- **Qwen**: 2, 2.5, 3
- **Llama**: 3, 3.1, 3.3, 4
- **Gemma**: 2, 3, 3N
- **GLM**: 4.5, 4.6, 4.7
- **MiniMax**: M2, M2.1
- **Others**: Mistral, Mixtral, Phi, gpt-oss, and any model supported by SGLang and Megatron

Training scenarios span multi-turn interaction, unified VLM/LLM workflows, reasoning and coding tasks, and multi-agent co-evolutionary frameworks such as MrlX.

## Setup Path

Miles recommends using its official Docker image for best performance and compatibility. It can also be installed from source via pip. Training is launched through a unified `train.py` entry point with command-line arguments for configuring cluster resources, training backends (Megatron/FSDP), SGLang inference optimization, and RL algorithmic hyperparameters. A detailed argument guide and Quick Start documentation are available in the repository's `docs/` directory.

## Update: Active Development Through Early 2026

The project has seen rapid iteration since its November 2025 launch. Notable recent additions include:
- **[2026/02]** Detailed command-line argument guide for Miles server configuration
- **[2026/01]** INT4 QAT pipeline for single-machine 1TB model training
- **[2026/01]** Unified VLM/LLM multi-turn training support
- **[2026/01]** MrlX multi-agent co-evolutionary framework integration
- **[2025/12]** Rollout Routing Replay (R3) for MoE RL stability

The roadmap lists planned support for Diffusion RL, Omni RL, Diffusion LLM RL, and elastic resource scheduling. The repository had 1,378 stars and 220 forks as of late May 2026, per GitHub metadata.

## Features
- Unified FP8 end-to-end training and rollout pipeline
- INT4 Quantization-Aware Training (QAT) for 1TB+ models
- Rollout Routing Replay (R3) for MoE RL stability
- Zero-copy weight synchronization via CUDA IPC
- Speculative RL training with Online SFT Draft Model
- Multi-turn LLM and VLM training support
- Multi-agent co-evolutionary RL (MrlX)
- Truncated and Masked Importance Sampling (TIS/MIS)
- Partial rollout and over-sampling for long-tail RL
- Support for DeepSeek, Qwen, Llama, Gemma, GLM, MiniMax, Mistral, Phi
- SGLang integration for high-throughput rollout
- Megatron-LM integration for scalable distributed training
- FSDP training backend support
- Docker image for production deployment
- Detailed command-line argument configuration

## Integrations
SGLang, Megatron-LM, FSDP, FlashAttention-3, DeepGEMM, NVIDIA Transformer Engine, Docker, CUDA IPC, MrlX, slime

## Platforms
CLI, API, DEVELOPER_SDK

## Pricing
Open Source

## Version
main

## Links
- Website: https://github.com/radixark/miles
- Documentation: https://www.radixark.com/miles/docs
- Repository: https://github.com/radixark/miles
- EveryDev.ai: https://www.everydev.ai/tools/miles-rl