ExploitBench
ExploitBench measures how far AI agents can climb the exploitation ladder, from reaching vulnerable code to achieving arbitrary code execution, using a five-tier grading system against real CVEs.
At a Glance
About ExploitBench
ExploitBench is an open-source AI security benchmark created by Seunghyun Lee and Prof. David Brumley at Carnegie Mellon University. It evaluates AI agent capability across the full exploitation pipeline — not just whether a bug can be triggered, but how far an agent can progress toward arbitrary code execution. The project is publicly available on GitHub under the MIT License and publishes live leaderboard results at exploitbench.ai.
What It Is
ExploitBench is a benchmark framework for measuring AI agent exploitation capability against real-world vulnerabilities. Unlike prior benchmarks that score a binary pass/fail on whether an exploit works, ExploitBench grades each of 16 distinct capabilities organized into five tiers — from reaching vulnerable code (T5) up through crash reproduction (T4), target-specific primitives (T3), generic memory primitives (T2), and full arbitrary code execution (T1). The first published benchmark, v8-bench, targets V8 — the JavaScript and WebAssembly engine inside Chrome, Edge, Node.js, and Cloudflare Workers — and runs against production V8 with the V8 security sandbox enabled.
The Five-Tier Exploitation Ladder
The benchmark's core design is a hierarchical capability model that makes partial results measurable:
- T1 – Full control: Control-flow hijack with arbitrary code execution (ACE), proven by a per-round shellcode/ROP payload.
- T2 – Generic primitives: Arbitrary read/write and information leaks outside the V8 sandbox boundary.
- T3 – Target primitives: V8-specific primitives (
addrof,fakeobj,caged_read,caged_write) that turn a bug into reusable exploit building blocks inside the sandbox. - T4 – Reproduction: Crash, sanitizer report, or differential behavior confirming the bug was reached — the level targeted by prior benchmarks such as CyberGym, CyBench, and SEC-bench Pro.
- T5 – Coverage: Reaching the patched function or line without a crash signal.
Every tier is graded mechanically by a deterministic verifier built into V8's standalone shell (d8), with no LLM-as-judge and no human review in the loop.
Architecture and Setup Path
ExploitBench drives any model exposed via direct provider API (Anthropic native SDK, OpenAI via LiteLLM, Gemini, OpenRouter) or an OpenAI-compatible gateway. Evaluation environments run inside Docker containers that expose an MCP server interface; the agent calls setup(), exec(), read_file(), write_file(), list_directory(), and grade() to drive the episode end-to-end. Pre-built V8 evaluation images (~65–70 GB each) are published to GitHub Container Registry and pulled on first use. The benchmark config is a YAML file specifying models, environments, seeds, turn budgets, and token budgets. Results are stored in a local SQLite database and can be exported as JSON, CSV, or Markdown.
AutoNudge and Evaluation Methodology
The benchmark supports an optional AutoNudge mechanism that automatically reminds a stalled or quitting model to grade its progress and continue working, with no human in the loop. Results are published both with and without AutoNudge enabled to allow comparison. The leaderboard on exploitbench.ai reports mean capability score (out of a max of 16) across all 41 V8 CVEs in v8-bench. The site notes that Claude Mythos Preview and GPT-5.5 achieve full arbitrary code execution on production V8 with the security sandbox enabled across multiple CVEs.
Current Status: v8-bench Launch
The repository was created in May 2026 and the v8-bench benchmark — the first ExploitBench release — launched alongside the public website. The GitHub README documents milestone status: multi-model V8 benchmarking via LiteLLM (M1) and the public results site (M2) are shipped; engineering foundation work (M3) including the rlenv-mcp adapter and capability taxonomy is in progress; detect/exploit/patch tasks for open-source images (M4) are pending. The project is MIT-licensed with 193 stars and 11 forks as of the last recorded update. Academic researchers and model providers can contact the team at contact@exploitbench.ai for replication support or to have new models added to the leaderboard.
Community Discussions
Be the first to start a conversation about ExploitBench
Share your experience with ExploitBench, ask questions, or help others learn from your insights.
Pricing
Open Source
Fully open-source under MIT License. Free to use, modify, and distribute.
- Full benchmark framework source code
- CLI with all benchmark, audit, and aggregate commands
- Multi-model support (Anthropic, OpenAI, Gemini, OpenRouter)
- Docker-based V8 evaluation environments
- Pre-built images on GitHub Container Registry
Capabilities
Key Features
- Five-tier exploitation ladder grading (T1–T5)
- 16 distinct capability checks per CVE
- Deterministic verifier built into V8's d8 shell (no LLM-as-judge)
- 41 V8 CVE environments in v8-bench
- AutoNudge mechanism for stalled agents
- Multi-model support via Anthropic SDK, LiteLLM, OpenAI-compatible gateways
- Docker-based isolated evaluation environments with MCP server interface
- Pre-built V8 evaluation images on GitHub Container Registry
- YAML/JSON benchmark configuration
- SQLite results database with JSON/CSV/Markdown export
- FastAPI read backend for local DB querying
- Audit bundle generation with SHA256 manifests
- Cost tracking per episode
- Resume and retry-failed episode support
- CLI with benchmark, aggregate, audit, summary, doctor commands
