# AgentBench

> AgentBench is an open-source benchmark framework for evaluating LLMs as autonomous agents across 8 diverse environments including OS, database, web, and knowledge graph tasks.

AgentBench is the first comprehensive benchmark designed to evaluate large language models (LLMs) as autonomous agents across a diverse spectrum of environments. Published at ICLR'24, it encompasses 8 distinct task environments — including Operating System, Database, Knowledge Graph, Digital Card Game, Lateral Thinking Puzzles, House-Holding, Web Shopping, and Web Browsing — to rigorously assess LLM agent capabilities. The framework supports fully containerized deployment via Docker Compose and integrates with AgentRL for end-to-end multitask, multi-turn LLM agent reinforcement learning.

**Key Features:**

- **8 Diverse Evaluation Environments** — *Tests agents across OS interaction, database querying, knowledge graph traversal, web shopping, web browsing, card games, lateral thinking puzzles, and house-holding tasks for comprehensive coverage.*
- **AgentBench FC (Function Calling)** — *The latest version integrates function-calling style prompts and fully containerized deployment via Docker Compose, built on the AgentRL framework for multi-turn RL training.*
- **Leaderboard** — *A public leaderboard tracks and compares performance of proprietary and open LLMs (GPT-4, Claude, open-source models) across all task environments.*
- **Docker-Based Task Workers** — *Each task environment runs in isolated Docker containers, enabling reproducible and scalable benchmarking with configurable concurrency.*
- **Extensible Architecture** — *Researchers can add new tasks following the Extension Guide, making it easy to expand the benchmark to new agent scenarios.*
- **VisualAgentBench Integration** — *Companion benchmark for evaluating visual foundation agents across embodied, GUI, and visual design environments using large multimodal models.*
- **Quick Start with Presets** — *Lite presets allow evaluation on laptops with limited RAM; full presets support high-concurrency multi-worker deployments.*
- **Open Source under Apache 2.0** — *Freely available to use, modify, and distribute; community contributions and result submissions are welcomed via Google Groups and Slack.*

To get started, clone the repository, set up a Python 3.9 conda environment, install dependencies, pull the required Docker images, configure your LLM API key, and launch task workers and the assigner using the provided scripts.

## Features
- 8 diverse agent evaluation environments
- Function-calling benchmark (AgentBench FC)
- Docker-based containerized task workers
- Public leaderboard for LLM comparison
- Multi-turn interaction evaluation
- AgentRL integration for RL training
- VisualAgentBench for multimodal agents
- Extensible task framework
- Dev and Test dataset splits
- Lite preset for low-resource machines

## Integrations
OpenAI GPT (gpt-3.5-turbo, gpt-4), Docker, Docker Compose, Redis, MySQL, AgentRL, ALFWorld, WebShop, Mind2Web, Freebase

## Platforms
MACOS, API, DEVELOPER_SDK, CLI

## Pricing
Open Source

## Version
v0.2

## Links
- Website: https://github.com/THUDM/AgentBench
- Documentation: https://github.com/THUDM/AgentBench/blob/main/docs/Introduction_en.md
- Repository: https://github.com/THUDM/AgentBench
- EveryDev.ai: https://www.everydev.ai/tools/agentbench
