Main Menu
  • Tools
  • Developers
  • Topics
  • Discussions
  • Communities
  • News
  • Blogs
  • Builds
  • Contests
  • Compare
  • Arena
Create
    EveryDev.ai
    Sign inSubscribe
    Home
    Tools

    2,205+ AI tools

    • New
    • Trending
    • Featured
    • Compare
    • Arena
    Categories
    • Agents1369
    • Coding1086
    • Infrastructure472
    • Marketing420
    • Design383
    • Projects348
    • Research325
    • Analytics323
    • Testing206
    • MCP183
    • Data181
    • Security178
    • Integration172
    • Learning148
    • Communication133
    • Prompts130
    • Extensions123
    • Commerce118
    • Voice111
    • DevOps96
    • Web73
    • Finance20
    1. Home
    2. Tools
    3. AgentBench
    AgentBench icon

    AgentBench

    LLM Evaluations

    AgentBench is an open-source benchmark framework for evaluating LLMs as autonomous agents across 8 diverse environments including OS, database, web, and knowledge graph tasks.

    Visit Website

    At a Glance

    Pricing
    Open Source

    Fully free and open-source under Apache License 2.0. Free to use, modify, and distribute.

    Engagement

    Available On

    macOS
    API
    SDK
    CLI

    Resources

    WebsiteDocsGitHubllms.txt

    Topics

    LLM EvaluationsAgent FrameworksAutonomous Systems

    Alternatives

    LangChainAgent Reading TestPandaProbe
    Developer
    THUDMTHUDM (Tsinghua University Data Mining group) builds large-s…

    Listed May 2026

    About AgentBench

    AgentBench is the first comprehensive benchmark designed to evaluate large language models (LLMs) as autonomous agents across a diverse spectrum of environments. Published at ICLR'24, it encompasses 8 distinct task environments — including Operating System, Database, Knowledge Graph, Digital Card Game, Lateral Thinking Puzzles, House-Holding, Web Shopping, and Web Browsing — to rigorously assess LLM agent capabilities. The framework supports fully containerized deployment via Docker Compose and integrates with AgentRL for end-to-end multitask, multi-turn LLM agent reinforcement learning.

    Key Features:

    • 8 Diverse Evaluation Environments — Tests agents across OS interaction, database querying, knowledge graph traversal, web shopping, web browsing, card games, lateral thinking puzzles, and house-holding tasks for comprehensive coverage.
    • AgentBench FC (Function Calling) — The latest version integrates function-calling style prompts and fully containerized deployment via Docker Compose, built on the AgentRL framework for multi-turn RL training.
    • Leaderboard — A public leaderboard tracks and compares performance of proprietary and open LLMs (GPT-4, Claude, open-source models) across all task environments.
    • Docker-Based Task Workers — Each task environment runs in isolated Docker containers, enabling reproducible and scalable benchmarking with configurable concurrency.
    • Extensible Architecture — Researchers can add new tasks following the Extension Guide, making it easy to expand the benchmark to new agent scenarios.
    • VisualAgentBench Integration — Companion benchmark for evaluating visual foundation agents across embodied, GUI, and visual design environments using large multimodal models.
    • Quick Start with Presets — Lite presets allow evaluation on laptops with limited RAM; full presets support high-concurrency multi-worker deployments.
    • Open Source under Apache 2.0 — Freely available to use, modify, and distribute; community contributions and result submissions are welcomed via Google Groups and Slack.

    To get started, clone the repository, set up a Python 3.9 conda environment, install dependencies, pull the required Docker images, configure your LLM API key, and launch task workers and the assigner using the provided scripts.

    AgentBench - 1

    Community Discussions

    Be the first to start a conversation about AgentBench

    Share your experience with AgentBench, ask questions, or help others learn from your insights.

    Pricing

    OPEN SOURCE

    Open Source

    Fully free and open-source under Apache License 2.0. Free to use, modify, and distribute.

    • 8 diverse agent evaluation environments
    • AgentBench FC (Function Calling) support
    • Docker-based containerized deployment
    • Public leaderboard access
    • AgentRL integration

    Capabilities

    Key Features

    • 8 diverse agent evaluation environments
    • Function-calling benchmark (AgentBench FC)
    • Docker-based containerized task workers
    • Public leaderboard for LLM comparison
    • Multi-turn interaction evaluation
    • AgentRL integration for RL training
    • VisualAgentBench for multimodal agents
    • Extensible task framework
    • Dev and Test dataset splits
    • Lite preset for low-resource machines

    Integrations

    OpenAI GPT (gpt-3.5-turbo, gpt-4)
    Docker
    Docker Compose
    Redis
    MySQL
    AgentRL
    ALFWorld
    WebShop
    Mind2Web
    Freebase
    API Available
    View Docs

    Reviews & Ratings

    No ratings yet

    Be the first to rate AgentBench and help others make informed decisions.

    Developer

    THUDM

    THUDM (Tsinghua University Data Mining group) builds large-scale AI research tools and benchmarks, including AgentBench and AgentRL. The group develops open-source frameworks for evaluating and training LLM-based agents across diverse real-world environments. Their work spans language model evaluation, reinforcement learning for agents, and multimodal AI systems, with publications at top venues like ICLR.

    Read more about THUDM
    WebsiteGitHubX / Twitter
    1 tool in directory

    Similar Tools

    LangChain icon

    LangChain

    LangChain provides LangSmith, an agent engineering platform, and open source frameworks (LangChain, LangGraph, deepagents) to help developers observe, evaluate, and deploy AI agents in production.

    Agent Reading Test icon

    Agent Reading Test

    A benchmark that tests how well AI coding agents can read web content, surfacing silent failure modes like truncation, CSS burial, SPA shells, and broken markdown parsing.

    PandaProbe icon

    PandaProbe

    Open source agent engineering platform providing traces, evals, metrics, and live monitoring to debug and improve AI agents.

    Browse all tools

    Related Topics

    LLM Evaluations

    Platforms and frameworks for evaluating, testing, and benchmarking LLM systems and AI applications. These tools provide evaluators and evaluation models to score AI outputs, measure hallucinations, assess RAG quality, detect failures, and optimize model performance. Features include automated testing with LLM-as-a-judge metrics, component-level evaluation with tracing, regression testing in CI/CD pipelines, custom evaluator creation, dataset curation, and real-time monitoring of production systems. Teams use these solutions to validate prompt effectiveness, compare models side-by-side, ensure answer correctness and relevance, identify bias and toxicity, prevent PII leakage, and continuously improve AI product quality through experiments, benchmarks, and performance analytics.

    65 tools

    Agent Frameworks

    Tools and platforms for building and deploying custom AI agents.

    260 tools

    Autonomous Systems

    AI agents that can perform complex tasks with minimal human guidance.

    184 tools
    Browse all topics
    Back to all tools
    Explore AI Tools
    • AI Coding Assistants
    • Agent Frameworks
    • MCP Servers
    • AI Prompt Tools
    • Vibe Coding Tools
    • AI Design Tools
    • AI Database Tools
    • AI Website Builders
    • AI Testing Tools
    • LLM Evaluations
    Follow Us
    • X / Twitter
    • LinkedIn
    • Reddit
    • Discord
    • Threads
    • Bluesky
    • Mastodon
    • YouTube
    • GitHub
    • Instagram
    Get Started
    • About
    • Editorial Standards
    • Corrections & Disclosures
    • Community Guidelines
    • Advertise
    • Contact Us
    • Newsletter
    • Submit a Tool
    • Start a Discussion
    • Write A Blog
    • Share A Build
    • Terms of Service
    • Privacy Policy
    Explore with AI
    • ChatGPT
    • Gemini
    • Claude
    • Grok
    • Perplexity
    Agent Experience
    • llms.txt
    Theme
    With AI, Everyone is a Dev. EveryDev.ai © 2026
    Discussions