Toolathlon
Toolathlon is an open-source benchmark for evaluating language agents on diverse, realistic, and long-horizon tool-use tasks across 32 software applications and 604 tools.
At a Glance
Fully free and open-source benchmark available on GitHub. Includes public evaluation service, self-hosted setup, and all benchmark tasks.
Engagement
Available On
Alternatives
Listed May 2026
About Toolathlon
Toolathlon is a research benchmark developed by the HKUST NLP group to assess how well language agents can use tools in realistic, multi-step workflows. It was accepted at ICLR 2026 and covers 32 software applications, 604 tools, and 108 manually sourced or crafted tasks. The project is hosted on GitHub and provides both a self-hosted evaluation path and a ready-to-use public evaluation service.
What It Is
Toolathlon (formally "The Tool Decathlon") is an execution-based benchmark designed to stress-test language agents on long-horizon, multi-application tasks. Unlike narrow benchmarks that test single-tool calls, each Toolathlon task requires approximately 20 interaction turns on average and spans multiple applications simultaneously. Tasks are strictly verifiable through dedicated evaluation scripts, making results reproducible and comparable across models.
Benchmark Scope and Task Design
The benchmark spans a wide range of software environments, from everyday platforms such as Google Calendar and Notion to professional tools like WooCommerce, Kubernetes, and BigQuery. The 108 tasks are grouped into thematic categories including Campus & Study, Tech & Dev, Finance & Market, Office & Business, and Shopping & E-commerce. Example tasks include grading homework submissions by downloading them from email and running them against Canvas, deploying a Kubernetes PR preview, and syncing warehouse inventory to a WooCommerce store.
- 32 software applications covered
- 604 tools available to agents
- ~20 interaction turns required per task on average
- Evaluation is execution-based with dedicated verification scripts
Leaderboard and Model Coverage
The project website publishes a live leaderboard. According to the leaderboard data, top-performing models as of mid-2026 include Gemini-3.5-Flash (Pass@1: 56.5%), GPT-5.5-xhigh (55.6%), DeepSeek-V4-Pro Max (52.8%), and Claude-Opus-4.7 (52.8%). Both proprietary and open-source models are tracked. Trajectory data for evaluated models is published on Hugging Face at hkust-nlp/Toolathlon-Trajectories.
Deployment and Evaluation Paths
Toolathlon supports four evaluation modes:
- Public evaluation service: A hosted server where MCP accounts are pre-configured; users only need an OpenAI-compatible API endpoint.
- Self-hosted: Full local setup using Docker/Podman, uv, and deployed application containers.
- Dedicated service: Available by contacting the authors for high-volume users.
- API-endpoint testing: Authors can run evaluation on behalf of users given an API endpoint.
The benchmark also supports a decoupled agent loop mode, where the task environment runs in a container but the agent scaffold runs on the host. Supported agent frameworks include toolathlon_default (based on the OpenAI Agents SDK) and claude_agent_sdk. OpenHands integration is available on a separate branch.
Update: ICLR 2026 Acceptance and Recent Activity
The repository was created in October 2025 and last updated in May 2026. The project was accepted at ICLR 2026. Recent news entries from the repository include trajectory data for four new models (gemini-3-pro, claude-4.5-opus, gpt-5.1, deepseek-v3.2-thinking) added in December 2025, a public evaluation service launched in November 2025, and a new documentation page for common issues and update logs set up in December 2025. The repository had 363 stars and 40 forks as of the last recorded update.
Community Discussions
Be the first to start a conversation about Toolathlon
Share your experience with Toolathlon, ask questions, or help others learn from your insights.
Pricing
Open Source
Fully free and open-source benchmark available on GitHub. Includes public evaluation service, self-hosted setup, and all benchmark tasks.
- Access to all 108 benchmark tasks
- Public evaluation service
- Self-hosted evaluation option
- Trajectory visualization tool
- Hugging Face trajectory dataset
Capabilities
Key Features
- 600+ diverse tools across 32 software applications
- 108 manually sourced or crafted long-horizon tasks
- Execution-based evaluation with dedicated verification scripts
- Public evaluation service (no setup required)
- Self-hosted evaluation via Docker/Podman
- Decoupled agent loop supporting multiple agent frameworks
- Parallel task execution with container isolation
- Trajectory visualization tool
- Live leaderboard with Pass@1, Pass@3 metrics
- Hugging Face trajectory dataset
- Multi-instance configuration support
- OpenHands compatibility branch
