An open-source research codebase for training, evaluating, and deploying simple yet powerful terminal-using LLM agents, covering data generation, SFT, and RL training pipelines.
At a Glance
Fully free and open-source under Apache 2.0. Self-host and use the codebase, models, and datasets at no cost.
Engagement
Available On
Listed Jun 2026
About TMax
TMax is an open-source project from AllenAI (Allen Institute for AI) focused on building simple, powerful terminal-using agents. Released under the Apache 2.0 license, the codebase covers the full lifecycle of terminal agent development: synthetic data generation, supervised fine-tuning (SFT), reinforcement learning (RL) training, and evaluation against benchmarks like Terminal-Bench and SWE-bench.
What It Is
TMax is a research framework for training LLM-based agents that interact with a terminal (bash shell) to complete tasks. The project trains a series of models — referred to as the "tmax" series — and provides all the tooling needed to reproduce or extend that work. It is accompanied by a paper on arXiv (2606.23321) and a blog post from the WAI organization. The codebase is written primarily in Python and managed with uv for dependency handling.
Four-Stage Pipeline Architecture
The repository is organized around four distinct stages:
- Data generation (
rl_data/): A scalable, diversity-aware pipeline that synthesizes terminal-agent tasks by sampling from structured compositional axes. Tasks are packaged as self-contained Apptainer/Docker environments with programmatic verifiers, then solved at pass@k and published to Hugging Face Hub. - Agent (
Vanillux2Agent/): A direct LiteLLM agent built on the vanillux prompt harness — derived from mini-SWE-agent prompts — with a bash tool schema, submit marker, format-error recovery, and output truncation. It executes commands through Harbor's active environment. - Training (
training/open-instruct/): A fork of AllenAI's open-instruct repository with fixes for Qwen 3.5 and terminal-agent training. SFT and DPPO RL launch scripts for tmax models are provided undertraining/open-instruct/scripts/tmax/. - Evaluation (
scripts/+beaker_configs/): Shell/Slurm launchers and a Beaker pipeline that serves a model with vLLM and runs Harbor datasets against it.
Task Data and the Harbor Ecosystem
TMax ships a full 15k task corpus in Harbor format, published on the Harbor registry as tmax/TMax-15K-Harbor. This corpus combines a legacy 10k set of self-contained tasks with 5k newer intricate multi-modal tasks. Every task includes a self-contained Harbor environment and a programmatic verifier, enabling any agent or model to be evaluated directly without regenerating data. The Harbor framework supports both local Docker and cloud-based Daytona sandbox execution.
Requirements and Setup Path
Running TMax requires:
uvfor Python dependency management- An LLM API key (e.g.,
GEMINI_API_KEY) or a local vLLM/Ollama/OpenAI-compatible endpoint apptaineron PATH for building and running task containers (data generation only)- A Dockerhub login and personal access token for training at scale
HF_TOKENfor Hugging Face upload and gated model access- A container runtime (Docker or Daytona) for evaluating on the published Harbor dataset
The quickstart involves running uv sync, then using provided shell scripts to generate tasks, solve them, analyze pass@k statistics, train models, and run evaluations.
Update: Initial Release
The repository was created in March 2026 and last updated in June 2026, with the initial release of the codebase, models, and the accompanying arXiv paper ("Tmax: A simple recipe for terminal agents"). The authors include Hamish Ivison, Junjie Oscar Yin, Rulin Shao, Teng Xiao, Nathan Lambert, and Hannaneh Hajishirzi. Models and datasets are published on Hugging Face under the allenai/tmax collection.
Community Discussions
Be the first to start a conversation about TMax
Share your experience with TMax, ask questions, or help others learn from your insights.
Pricing
Open Source
Fully free and open-source under Apache 2.0. Self-host and use the codebase, models, and datasets at no cost.
- Full codebase access under Apache 2.0
- Data generation pipeline
- SFT and RL training scripts
- Evaluation pipeline (Terminal-Bench, SWE-bench)
- 15k Harbor task corpus
Capabilities
Key Features
- Terminal-using LLM agent training and evaluation
- Compositional synthetic task data generation pipeline
- Pass@k task solving with programmatic verifiers
- SFT and DPPO RL training via open-instruct fork
- Vanillux2Agent with bash tool schema and format-error recovery
- 15k Harbor task corpus with self-contained environments
- vLLM model serving integration
- Beaker and Slurm evaluation pipeline
- Daytona and Docker sandbox support
- Hugging Face Hub dataset publishing
