Toolathlon

Name: Toolathlon
Availability: OnlineOnly
Author: HKUST NLP

Toolathlon is an open-source benchmark for evaluating language agents on diverse, realistic, and long-horizon tool-use tasks across 32 software applications and 604 tools.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source benchmark available on GitHub. Includes public evaluation service, self-hosted setup, and all benchmark tasks.

Engagement

Available On

macOS

Linux

Web

API

CLI

HKUST NLPHKUST NLP is the natural language processing research group…

Listed May 2026

About Toolathlon

Toolathlon is a research benchmark developed by the HKUST NLP group to assess how well language agents can use tools in realistic, multi-step workflows. It was accepted at ICLR 2026 and covers 32 software applications, 604 tools, and 108 manually sourced or crafted tasks. The project is hosted on GitHub and provides both a self-hosted evaluation path and a ready-to-use public evaluation service.

What It Is

Toolathlon (formally "The Tool Decathlon") is an execution-based benchmark designed to stress-test language agents on long-horizon, multi-application tasks. Unlike narrow benchmarks that test single-tool calls, each Toolathlon task requires approximately 20 interaction turns on average and spans multiple applications simultaneously. Tasks are strictly verifiable through dedicated evaluation scripts, making results reproducible and comparable across models.

Benchmark Scope and Task Design

The benchmark spans a wide range of software environments, from everyday platforms such as Google Calendar and Notion to professional tools like WooCommerce, Kubernetes, and BigQuery. The 108 tasks are grouped into thematic categories including Campus & Study, Tech & Dev, Finance & Market, Office & Business, and Shopping & E-commerce. Example tasks include grading homework submissions by downloading them from email and running them against Canvas, deploying a Kubernetes PR preview, and syncing warehouse inventory to a WooCommerce store.

32 software applications covered
604 tools available to agents
~20 interaction turns required per task on average
Evaluation is execution-based with dedicated verification scripts

Leaderboard and Model Coverage

The project website publishes a live leaderboard. According to the leaderboard data, top-performing models as of mid-2026 include Gemini-3.5-Flash (Pass@1: 56.5%), GPT-5.5-xhigh (55.6%), DeepSeek-V4-Pro Max (52.8%), and Claude-Opus-4.7 (52.8%). Both proprietary and open-source models are tracked. Trajectory data for evaluated models is published on Hugging Face at hkust-nlp/Toolathlon-Trajectories.

Deployment and Evaluation Paths

Toolathlon supports four evaluation modes:

Public evaluation service: A hosted server where MCP accounts are pre-configured; users only need an OpenAI-compatible API endpoint.
Self-hosted: Full local setup using Docker/Podman, uv, and deployed application containers.
Dedicated service: Available by contacting the authors for high-volume users.
API-endpoint testing: Authors can run evaluation on behalf of users given an API endpoint.

The benchmark also supports a decoupled agent loop mode, where the task environment runs in a container but the agent scaffold runs on the host. Supported agent frameworks include toolathlon_default (based on the OpenAI Agents SDK) and claude_agent_sdk. OpenHands integration is available on a separate branch.

Update: ICLR 2026 Acceptance and Recent Activity

The repository was created in October 2025 and last updated in May 2026. The project was accepted at ICLR 2026. Recent news entries from the repository include trajectory data for four new models (gemini-3-pro, claude-4.5-opus, gpt-5.1, deepseek-v3.2-thinking) added in December 2025, a public evaluation service launched in November 2025, and a new documentation page for common issues and update logs set up in December 2025. The repository had 363 stars and 40 forks as of the last recorded update.

Community Discussions

Be the first to start a conversation about Toolathlon

Share your experience with Toolathlon, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully free and open-source benchmark available on GitHub. Includes public evaluation service, self-hosted setup, and all benchmark tasks.

Access to all 108 benchmark tasks
Public evaluation service
Self-hosted evaluation option
Trajectory visualization tool
Hugging Face trajectory dataset

Capabilities

Key Features

600+ diverse tools across 32 software applications
108 manually sourced or crafted long-horizon tasks
Execution-based evaluation with dedicated verification scripts
Public evaluation service (no setup required)
Self-hosted evaluation via Docker/Podman
Decoupled agent loop supporting multiple agent frameworks
Parallel task execution with container isolation
Trajectory visualization tool
Live leaderboard with Pass@1, Pass@3 metrics
Hugging Face trajectory dataset
Multi-instance configuration support
OpenHands compatibility branch

Integrations

Google Calendar

Notion

WooCommerce

Kubernetes

BigQuery

Canvas LMS

MinIO

OpenAI Agents SDK

Claude Agent SDK

OpenHands

vLLM

SGLang

OpenRouter

Anthropic API

Docker

Podman

Hugging Face

API Available

View Docs

Back to all tools Suggest an edit