AgentBench

Name: AgentBench
Availability: OnlineOnly
Author: THUDM

AgentBench is an open-source benchmark framework for evaluating LLMs as autonomous agents across 8 diverse environments including OS, database, web, and knowledge graph tasks.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source under Apache License 2.0. Free to use, modify, and distribute.

Engagement

Available On

macOS

API

SDK

CLI

THUDMTHUDM (Tsinghua University Data Mining group) builds large-s…

Listed May 2026

About AgentBench

AgentBench is the first comprehensive benchmark designed to evaluate large language models (LLMs) as autonomous agents across a diverse spectrum of environments. Published at ICLR'24, it encompasses 8 distinct task environments — including Operating System, Database, Knowledge Graph, Digital Card Game, Lateral Thinking Puzzles, House-Holding, Web Shopping, and Web Browsing — to rigorously assess LLM agent capabilities. The framework supports fully containerized deployment via Docker Compose and integrates with AgentRL for end-to-end multitask, multi-turn LLM agent reinforcement learning.

Key Features:

8 Diverse Evaluation Environments — Tests agents across OS interaction, database querying, knowledge graph traversal, web shopping, web browsing, card games, lateral thinking puzzles, and house-holding tasks for comprehensive coverage.
AgentBench FC (Function Calling) — The latest version integrates function-calling style prompts and fully containerized deployment via Docker Compose, built on the AgentRL framework for multi-turn RL training.
Leaderboard — A public leaderboard tracks and compares performance of proprietary and open LLMs (GPT-4, Claude, open-source models) across all task environments.
Docker-Based Task Workers — Each task environment runs in isolated Docker containers, enabling reproducible and scalable benchmarking with configurable concurrency.
Extensible Architecture — Researchers can add new tasks following the Extension Guide, making it easy to expand the benchmark to new agent scenarios.
VisualAgentBench Integration — Companion benchmark for evaluating visual foundation agents across embodied, GUI, and visual design environments using large multimodal models.
Quick Start with Presets — Lite presets allow evaluation on laptops with limited RAM; full presets support high-concurrency multi-worker deployments.
Open Source under Apache 2.0 — Freely available to use, modify, and distribute; community contributions and result submissions are welcomed via Google Groups and Slack.

To get started, clone the repository, set up a Python 3.9 conda environment, install dependencies, pull the required Docker images, configure your LLM API key, and launch task workers and the assigner using the provided scripts.

Community Discussions

Be the first to start a conversation about AgentBench

Share your experience with AgentBench, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully free and open-source under Apache License 2.0. Free to use, modify, and distribute.

8 diverse agent evaluation environments
AgentBench FC (Function Calling) support
Docker-based containerized deployment
Public leaderboard access
AgentRL integration

Capabilities

Key Features

8 diverse agent evaluation environments
Function-calling benchmark (AgentBench FC)
Docker-based containerized task workers
Public leaderboard for LLM comparison
Multi-turn interaction evaluation
AgentRL integration for RL training
VisualAgentBench for multimodal agents
Extensible task framework
Dev and Test dataset splits
Lite preset for low-resource machines

Integrations

OpenAI GPT (gpt-3.5-turbo, gpt-4)

Docker

Docker Compose

Redis

MySQL

AgentRL

ALFWorld

WebShop

Mind2Web

Freebase

API Available

View Docs

Back to all tools

AgentBench

LLM Evaluations

AgentBench is an open-source benchmark framework for evaluating LLMs as autonomous agents across 8 diverse environments including OS, database, web, and knowledge graph tasks.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source under Apache License 2.0. Free to use, modify, and distribute.

Engagement

Discussions

Available On

macOS

API

SDK

CLI

Resources

Website Docs GitHub llms.txt

Topics

LLM Evaluations Agent Frameworks Autonomous Systems

Alternatives

LangChain Agent Reading Test PandaProbe

Developer

THUDMTHUDM (Tsinghua University Data Mining group) builds large-s…

Listed May 2026

About AgentBench

Key Features:

8 Diverse Evaluation Environments — Tests agents across OS interaction, database querying, knowledge graph traversal, web shopping, web browsing, card games, lateral thinking puzzles, and house-holding tasks for comprehensive coverage.
AgentBench FC (Function Calling) — The latest version integrates function-calling style prompts and fully containerized deployment via Docker Compose, built on the AgentRL framework for multi-turn RL training.
Leaderboard — A public leaderboard tracks and compares performance of proprietary and open LLMs (GPT-4, Claude, open-source models) across all task environments.
Docker-Based Task Workers — Each task environment runs in isolated Docker containers, enabling reproducible and scalable benchmarking with configurable concurrency.
Extensible Architecture — Researchers can add new tasks following the Extension Guide, making it easy to expand the benchmark to new agent scenarios.
VisualAgentBench Integration — Companion benchmark for evaluating visual foundation agents across embodied, GUI, and visual design environments using large multimodal models.
Quick Start with Presets — Lite presets allow evaluation on laptops with limited RAM; full presets support high-concurrency multi-worker deployments.
Open Source under Apache 2.0 — Freely available to use, modify, and distribute; community contributions and result submissions are welcomed via Google Groups and Slack.

Community Discussions

Be the first to start a conversation about AgentBench

Share your experience with AgentBench, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully free and open-source under Apache License 2.0. Free to use, modify, and distribute.

8 diverse agent evaluation environments
AgentBench FC (Function Calling) support
Docker-based containerized deployment
Public leaderboard access
AgentRL integration

Capabilities

Key Features

8 diverse agent evaluation environments
Function-calling benchmark (AgentBench FC)
Docker-based containerized task workers
Public leaderboard for LLM comparison
Multi-turn interaction evaluation
AgentRL integration for RL training
VisualAgentBench for multimodal agents
Extensible task framework
Dev and Test dataset splits
Lite preset for low-resource machines

Integrations

OpenAI GPT (gpt-3.5-turbo, gpt-4)

Docker

Docker Compose

Redis

MySQL

AgentRL

ALFWorld

WebShop

Mind2Web

Freebase

API Available

View Docs

Back to all tools