AgentBench
AgentBench is an open-source benchmark framework for evaluating LLMs as autonomous agents across 8 diverse environments including OS, database, web, and knowledge graph tasks.
At a Glance
Fully free and open-source under Apache License 2.0. Free to use, modify, and distribute.
Engagement
Available On
Alternatives
Listed May 2026
About AgentBench
AgentBench is the first comprehensive benchmark designed to evaluate large language models (LLMs) as autonomous agents across a diverse spectrum of environments. Published at ICLR'24, it encompasses 8 distinct task environments — including Operating System, Database, Knowledge Graph, Digital Card Game, Lateral Thinking Puzzles, House-Holding, Web Shopping, and Web Browsing — to rigorously assess LLM agent capabilities. The framework supports fully containerized deployment via Docker Compose and integrates with AgentRL for end-to-end multitask, multi-turn LLM agent reinforcement learning.
Key Features:
- 8 Diverse Evaluation Environments — Tests agents across OS interaction, database querying, knowledge graph traversal, web shopping, web browsing, card games, lateral thinking puzzles, and house-holding tasks for comprehensive coverage.
- AgentBench FC (Function Calling) — The latest version integrates function-calling style prompts and fully containerized deployment via Docker Compose, built on the AgentRL framework for multi-turn RL training.
- Leaderboard — A public leaderboard tracks and compares performance of proprietary and open LLMs (GPT-4, Claude, open-source models) across all task environments.
- Docker-Based Task Workers — Each task environment runs in isolated Docker containers, enabling reproducible and scalable benchmarking with configurable concurrency.
- Extensible Architecture — Researchers can add new tasks following the Extension Guide, making it easy to expand the benchmark to new agent scenarios.
- VisualAgentBench Integration — Companion benchmark for evaluating visual foundation agents across embodied, GUI, and visual design environments using large multimodal models.
- Quick Start with Presets — Lite presets allow evaluation on laptops with limited RAM; full presets support high-concurrency multi-worker deployments.
- Open Source under Apache 2.0 — Freely available to use, modify, and distribute; community contributions and result submissions are welcomed via Google Groups and Slack.
To get started, clone the repository, set up a Python 3.9 conda environment, install dependencies, pull the required Docker images, configure your LLM API key, and launch task workers and the assigner using the provided scripts.
Community Discussions
Be the first to start a conversation about AgentBench
Share your experience with AgentBench, ask questions, or help others learn from your insights.
Pricing
Open Source
Fully free and open-source under Apache License 2.0. Free to use, modify, and distribute.
- 8 diverse agent evaluation environments
- AgentBench FC (Function Calling) support
- Docker-based containerized deployment
- Public leaderboard access
- AgentRL integration
Capabilities
Key Features
- 8 diverse agent evaluation environments
- Function-calling benchmark (AgentBench FC)
- Docker-based containerized task workers
- Public leaderboard for LLM comparison
- Multi-turn interaction evaluation
- AgentRL integration for RL training
- VisualAgentBench for multimodal agents
- Extensible task framework
- Dev and Test dataset splits
- Lite preset for low-resource machines
