EveryDev.ai
Subscribe
Home
Tools

2,885+ AI tools

  • New
  • Trending
  • Featured
  • Compare
  • Arena
Categories
  • Agents2025
  • Coding1416
  • Infrastructure661
  • Marketing515
  • Projects463
  • Research429
  • Design406
  • Analytics368
  • Security263
  • MCP261
  • Testing251
  • Data244
  • Integration183
  • Prompts180
  • Communication167
  • Learning166
  • Extensions161
  • Voice145
  • Commerce128
  • DevOps115
  • Web84
  • Finance24
AI Tools by Topic
  • AI Coding Assistants
  • Agent Frameworks
  • MCP Servers
  • AI Prompt Tools
  • Vibe Coding Tools
  • AI Design Tools
  • AI Database Tools
  • AI Website Builders
  • AI Testing Tools
  • LLM Evaluations
Follow Us
  • X / Twitter
  • LinkedIn
  • Reddit
  • Discord
  • Threads
  • Bluesky
  • Mastodon
  • YouTube
  • GitHub
  • Instagram
Get Started
  • About
  • Editorial Standards
  • Corrections & Disclosures
  • Community Guidelines
  • Advertise
  • Contact Us
  • Newsletter
  • Submit a Tool
  • Start a Discussion
  • Write A Blog
  • Share A Build
  • Terms of Service
  • Privacy Policy
Explore with AI
  • ChatGPT
  • Gemini
  • Claude
  • Grok
  • Perplexity
Agent Experience
  • llms.txt
Theme
With AI, Everyone is a Dev. EveryDev.ai © 2026
    1. Home
    2. Tools
    3. terminal-bench
    terminal-bench icon

    terminal-bench

    LLM Evaluations

    Terminal-Bench is an open-source benchmark suite for evaluating AI agents' ability to complete complex tasks in terminal environments, built on the Harbor framework.

    Visit Website

    At a Glance

    Pricing
    Open Source

    Fully free and open-source under Apache License 2.0. Use, modify, and distribute freely.

    Engagement

    Available On

    Linux
    Web
    API
    CLI

    Resources

    WebsiteDocsGitHubllms.txt

    Topics

    LLM EvaluationsAutonomous SystemsAgent Frameworks

    Alternatives

    AgentBenchAgentic Harness EngineeringHalluminate
    Developer
    Harbor Framework TeamSan Francisco, CAEst. 2024$100M raised

    Listed Jun 2026

    About terminal-bench

    Terminal-Bench is a collection of Harbor-native benchmarks designed to help agent developers quantify how well AI agents perform complex tasks in terminal environments. It is described as a Stanford × Laude collaboration and is freely available as open-source software under the Apache 2.0 license. The project provides both a growing task dataset and an evaluation harness (Harbor) for running agents against those tasks in sandboxed Docker environments.

    What It Is

    Terminal-Bench sits in the AI agent evaluation category, specifically targeting terminal-use agents — systems that interact with computers through a command-line interface. The benchmark measures task resolution rate: whether an agent can successfully complete a given terminal task from start to finish. Tasks are hand-crafted, human-verified, and each ships with a dedicated Docker environment, a reference solution, and automated test cases. The evaluation harness, Harbor, is the official runner and is itself open-source under Apache 2.0.

    Benchmark Versions and Task Coverage

    Terminal-Bench has shipped multiple benchmark versions, each expanding scope and quality:

    • Terminal-Bench 1.0 — the original release with 80 tasks testing terminal task completion
    • Terminal-Bench 2.0 — 89 high-quality tasks spanning software engineering, machine learning, security, data science, and more; currently the primary leaderboard version
    • Terminal-Bench 2.1 — an improved version of 2.0, inspired by Z.ai's Terminal-Bench 2.0 Verified
    • Terminal-Bench 3.0 — in development; described as the next frontier benchmark
    • Terminal-Bench Science — in development; a domain-specific benchmark for scientific computing
    • Terminal-Bench Challenges — active; long-running single-task benchmarks covering inference engine code golf, Rust compiler speedup, and WASM rendering

    Task categories include system administration, security, data science, model training, coding, file operations, and scientific workflows.

    How the Evaluation Harness Works

    The Harbor framework orchestrates agent evaluations by spinning up multi-container Docker environments, logging agent actions, and verifying container state after each task attempt. It supports three agent integration modes:

    • Container installation — the agent is installed directly into the task environment (quickest path)
    • Direct integrations — agents with a Python interface (like the built-in Terminus agent) are integrated directly for full logging and API access
    • MCP Server — the harness exposes a tmux session to the agent under evaluation, enabling easy integration of MCP clients like Goose

    Harbor also supports massively parallel evaluations through cloud providers including Daytona, Modal, LangSmith, Blaxel, and Novita Sandbox. Third-party benchmarks such as SWE-Bench and Aider Polyglot are also supported via the harbor datasets list command.

    The Terminus Reference Agent

    Because some terminal agents do not support arbitrary language models, the team built Terminus — an intentionally minimal agent that provides no tools other than a tmux pane. Terminus sends keystrokes to the language model and is designed to avoid biasing performance toward any particular model. It serves as a neutral test-bed for comparing model performance across the leaderboard.

    Update: v0.15.0 and Active Development

    The Harbor repository (the official harness for Terminal-Bench) reached v0.15.0 as of June 19, 2026, with the repository last updated June 20, 2026. The GitHub repository shows 2,594 stars and 1,179 forks. Terminal-Bench 3.0 and Terminal-Bench Science are both listed as actively in development, with community contributions invited via Discord and GitHub. The roadmap includes training infrastructure for RL and rollout generation, VLM-as-a-judge support, and adapters for additional benchmarks including MLE-Bench, SWE-Lancer, and RE-Bench.

    terminal-bench - 1

    Community Discussions

    Be the first to start a conversation about terminal-bench

    Share your experience with terminal-bench, ask questions, or help others learn from your insights.

    Pricing

    OPEN SOURCE

    Open Source

    Fully free and open-source under Apache License 2.0. Use, modify, and distribute freely.

    • Full access to Terminal-Bench benchmark suite
    • Harbor evaluation harness
    • Docker-based sandboxed task environments
    • Public leaderboard access
    • Community Discord support

    Capabilities

    Key Features

    • Hand-crafted, human-verified terminal tasks
    • Dedicated Docker environment per task
    • Automated test cases for solution verification
    • Public leaderboard with task resolution rates
    • Multiple benchmark versions (1.0, 2.0, 2.1, 3.0 in progress)
    • Harbor evaluation harness for orchestrating agents
    • Terminus reference agent for neutral model comparison
    • MCP server integration for agent evaluation
    • Cloud provider support (Daytona, Modal, LangSmith, Blaxel, Novita Sandbox)
    • Parallel agent evaluations
    • Third-party benchmark support (SWE-Bench, Aider Polyglot)
    • Task registry with browsable task details
    • RL rollout generation support
    • Terminal-Bench Challenges for long-running single tasks
    • Terminal-Bench Science for scientific computing (in development)

    Integrations

    Docker
    Claude Code
    OpenHands
    Codex CLI
    Goose (MCP client)
    Daytona
    Modal
    LangSmith
    Blaxel
    Novita Sandbox
    SWE-Bench
    Aider Polyglot
    AppWorld
    Anthropic API
    OpenAI API
    Google Gemini API
    API Available
    View Docs

    Ratings & Reviews

    No ratings yet

    Be the first to rate terminal-bench and help others make informed decisions.

    Developer

    Harbor Framework Team

    The Harbor Framework Team builds open-source infrastructure for evaluating and optimizing AI agents and language models. Their flagship products are Terminal-Bench, a benchmark suite for terminal-use agents, and Harbor, the evaluation harness that powers it. The project is described as a Stanford × Laude collaboration, with contributors including researchers and engineers from those institutions. Harbor supports parallel cloud-based evaluations and RL environment generation, making it a general-purpose platform for agent developers.

    Founded 2024
    San Francisco, CA
    $100M raised
    20 employees

    Used by

    Stanford University
    AfterQuery
    Tensorlake
    Snorkel AI
    Read more about Harbor Framework Team
    WebsiteGitHubX / Twitter
    1 tool in directory

    Similar Tools

    AgentBench icon

    AgentBench

    AgentBench is an open-source benchmark framework for evaluating LLMs as autonomous agents across 8 diverse environments including OS, database, web, and knowledge graph tasks.

    Agentic Harness Engineering icon

    Agentic Harness Engineering

    An open-source observability system that automatically evolves coding-agent harnesses—system prompts, tools, middleware, skills, and memory—without changing the base model.

    Halluminate icon

    Halluminate

    Halluminate provides highly realistic RL environments for financial services to train frontier AI models on economically valuable workflows in investment banking, private equity, and more.

    Browse all tools

    Related Topics

    LLM Evaluations

    Platforms and frameworks for evaluating, testing, and benchmarking LLM systems and AI applications. These tools provide evaluators and evaluation models to score AI outputs, measure hallucinations, assess RAG quality, detect failures, and optimize model performance. Features include automated testing with LLM-as-a-judge metrics, component-level evaluation with tracing, regression testing in CI/CD pipelines, custom evaluator creation, dataset curation, and real-time monitoring of production systems. Teams use these solutions to validate prompt effectiveness, compare models side-by-side, ensure answer correctness and relevance, identify bias and toxicity, prevent PII leakage, and continuously improve AI product quality through experiments, benchmarks, and performance analytics.

    95 tools

    Autonomous Systems

    AI agents that can perform complex tasks with minimal human guidance.

    300 tools

    Agent Frameworks

    Tools and platforms for building and deploying custom AI agents.

    439 tools
    Browse all topics
    Back to all toolsSuggest an edit
    ratings
    discussions