WebArena
A standalone, self-hostable web environment for building and evaluating autonomous web agents on realistic tasks.
At a Glance
About WebArena
WebArena is an open-source benchmark environment for building and evaluating autonomous web agents, published as a research project by the web-arena-x organization. It provides a self-hostable suite of realistic websites — including a shopping site, Reddit clone, GitLab instance, map, and Wikipedia mirror — against which agents can be tested on 812 end-to-end tasks. The project was introduced in a paper presented at NeurIPS 2024 (Oral) and is available on GitHub under the Apache License 2.0.
What It Is
WebArena is a research-grade evaluation harness that simulates a realistic web browsing environment for autonomous agents. Rather than relying on live websites, it packages self-contained Docker-based web applications that agents can interact with through a Playwright-driven browser interface. The environment exposes observations as accessibility trees or HTML and accepts structured actions (clicks, typing, navigation), making it compatible with LLM-based agents that reason about web content.
Architecture and Setup
The environment is built on Python 3.10+ and uses Playwright for browser automation. Each test example is defined by a JSON config file, and the full benchmark consists of 812 such examples. Researchers spin up the included Docker images for each website, configure environment variables pointing to each service, and then run agents against the local stack. The repository includes:
- Docker resources and an Amazon Machine Image (AMI) with all websites pre-installed
- Auto-login cookie generation for all bundled websites
- A
ScriptBrowserEnvclass with an OpenAI Gym-style API (reset,step) - Baseline prompt-based agents using Chain-of-Thought and ReAct-style reasoning
Benchmark Scope and Related Projects
The webarena.dev project page describes WebArena as part of a broader suite of autonomous web agent benchmarks:
- WebArena — the original realistic web environment (NeurIPS 2024 Oral)
- WebArena-Infinity — continuous and scalable evaluation in evolving environments
- VisualWebArena — multimodal agents on visual web tasks (ACL 2024)
- TheAgentCompany — LLM agents on consequential real-world tasks in a simulated company (ICML 2025)
The web navigation infrastructure has also been extended by AgentLab (ServiceNow), which adds parallel experiment support via BrowserGym, integration of multiple benchmarks, and a unified leaderboard.
Update: v0.2.0 and December 2024 Enhancements
The latest tagged release is v0.2.0 (October 2023), which stabilized the annotation dataset after a full re-examination and bug-fix pass. The repository notes that no major annotation updates are expected beyond this version. In December 2024, the maintainers highlighted that AgentLab now provides the recommended framework for running experiments, offering parallel execution, unified leaderboard reporting, and improved edge-case handling. A public leaderboard is maintained via Google Sheets, and human annotator trajectories for approximately 170 tasks were released in December 2023 for reference.
Who It Is For
WebArena targets AI researchers and practitioners building or evaluating LLM-based web agents. It is particularly suited for teams working on browser automation, agent reasoning, and multi-step task completion in realistic web contexts. The Gym-style Python API lowers the barrier for integrating new agent architectures, and the modular prompt constructor design makes it straightforward to swap in custom prompting strategies.
Community Discussions
Be the first to start a conversation about WebArena
Share your experience with WebArena, ask questions, or help others learn from your insights.
Pricing
Open Source
Fully free and open-source under Apache License 2.0. Self-host the full benchmark environment.
- Apache License 2.0
- Full source code on GitHub
- 812 evaluation tasks
- Docker-based self-hosted websites
- Python API
Capabilities
Key Features
- Self-hostable web environment with Docker-based websites
- 812 end-to-end evaluation tasks
- OpenAI Gym-style Python API (reset/step)
- Accessibility tree and HTML observation spaces
- Playwright-based browser automation
- Auto-login cookie generation for bundled sites
- Baseline Chain-of-Thought and ReAct agents
- Amazon Machine Image with pre-installed websites
- Public leaderboard via Google Sheets
- Human annotator trajectory recordings
- Zeno integration for result analysis
- Modular prompt constructor design
