# TMax

> An open-source research codebase for training, evaluating, and deploying simple yet powerful terminal-using LLM agents, covering data generation, SFT, and RL training pipelines.

TMax is an open-source project from AllenAI (Allen Institute for AI) focused on building simple, powerful terminal-using agents. Released under the Apache 2.0 license, the codebase covers the full lifecycle of terminal agent development: synthetic data generation, supervised fine-tuning (SFT), reinforcement learning (RL) training, and evaluation against benchmarks like Terminal-Bench and SWE-bench.

## What It Is

TMax is a research framework for training LLM-based agents that interact with a terminal (bash shell) to complete tasks. The project trains a series of models — referred to as the "tmax" series — and provides all the tooling needed to reproduce or extend that work. It is accompanied by a paper on arXiv (2606.23321) and a blog post from the WAI organization. The codebase is written primarily in Python and managed with `uv` for dependency handling.

## Four-Stage Pipeline Architecture

The repository is organized around four distinct stages:

- **Data generation** (`rl_data/`): A scalable, diversity-aware pipeline that synthesizes terminal-agent tasks by sampling from structured compositional axes. Tasks are packaged as self-contained Apptainer/Docker environments with programmatic verifiers, then solved at pass@k and published to Hugging Face Hub.
- **Agent** (`Vanillux2Agent/`): A direct LiteLLM agent built on the vanillux prompt harness — derived from mini-SWE-agent prompts — with a bash tool schema, submit marker, format-error recovery, and output truncation. It executes commands through Harbor's active environment.
- **Training** (`training/open-instruct/`): A fork of AllenAI's open-instruct repository with fixes for Qwen 3.5 and terminal-agent training. SFT and DPPO RL launch scripts for tmax models are provided under `training/open-instruct/scripts/tmax/`.
- **Evaluation** (`scripts/` + `beaker_configs/`): Shell/Slurm launchers and a Beaker pipeline that serves a model with vLLM and runs Harbor datasets against it.

## Task Data and the Harbor Ecosystem

TMax ships a full **15k task corpus** in Harbor format, published on the Harbor registry as `tmax/TMax-15K-Harbor`. This corpus combines a legacy 10k set of self-contained tasks with 5k newer intricate multi-modal tasks. Every task includes a self-contained Harbor environment and a programmatic verifier, enabling any agent or model to be evaluated directly without regenerating data. The Harbor framework supports both local Docker and cloud-based Daytona sandbox execution.

## Requirements and Setup Path

Running TMax requires:
- `uv` for Python dependency management
- An LLM API key (e.g., `GEMINI_API_KEY`) or a local vLLM/Ollama/OpenAI-compatible endpoint
- `apptainer` on PATH for building and running task containers (data generation only)
- A Dockerhub login and personal access token for training at scale
- `HF_TOKEN` for Hugging Face upload and gated model access
- A container runtime (Docker or Daytona) for evaluating on the published Harbor dataset

The quickstart involves running `uv sync`, then using provided shell scripts to generate tasks, solve them, analyze pass@k statistics, train models, and run evaluations.

## Update: Initial Release

The repository was created in March 2026 and last updated in June 2026, with the initial release of the codebase, models, and the accompanying arXiv paper ("Tmax: A simple recipe for terminal agents"). The authors include Hamish Ivison, Junjie Oscar Yin, Rulin Shao, Teng Xiao, Nathan Lambert, and Hannaneh Hajishirzi. Models and datasets are published on Hugging Face under the `allenai/tmax` collection.

## Features
- Terminal-using LLM agent training and evaluation
- Compositional synthetic task data generation pipeline
- Pass@k task solving with programmatic verifiers
- SFT and DPPO RL training via open-instruct fork
- Vanillux2Agent with bash tool schema and format-error recovery
- 15k Harbor task corpus with self-contained environments
- vLLM model serving integration
- Beaker and Slurm evaluation pipeline
- Daytona and Docker sandbox support
- Hugging Face Hub dataset publishing

## Integrations
Hugging Face Hub, vLLM, LiteLLM, Harbor framework, Daytona, Docker, Apptainer, Beaker, Slurm, open-instruct, Qwen 3.5, Terminal-Bench, SWE-bench

## Platforms
CLI, API, DEVELOPER_SDK

## Pricing
Open Source

## Version
initial release

## Links
- Website: https://github.com/hamishivi/tmax
- Documentation: https://github.com/hamishivi/tmax/blob/master/rl_data/README.md
- Repository: https://github.com/hamishivi/tmax
- EveryDev.ai: https://www.everydev.ai/tools/tmax