ExploitBench

Name: ExploitBench
Availability: OnlineOnly
Author: ExploitBench

ExploitBench measures how far AI agents can climb the exploitation ladder, from reaching vulnerable code to achieving arbitrary code execution, using a five-tier grading system against real CVEs.

Visit Website

At a Glance

Pricing

Open Source

Fully open-source under MIT License. Free to use, modify, and distribute.

Engagement

Available On

CLI

API

Web

ExploitBenchPittsburgh, PAEst. 2024

Listed May 2026

About ExploitBench

ExploitBench is an open-source AI security benchmark created by Seunghyun Lee and Prof. David Brumley at Carnegie Mellon University. It evaluates AI agent capability across the full exploitation pipeline — not just whether a bug can be triggered, but how far an agent can progress toward arbitrary code execution. The project is publicly available on GitHub under the MIT License and publishes live leaderboard results at exploitbench.ai.

What It Is

ExploitBench is a benchmark framework for measuring AI agent exploitation capability against real-world vulnerabilities. Unlike prior benchmarks that score a binary pass/fail on whether an exploit works, ExploitBench grades each of 16 distinct capabilities organized into five tiers — from reaching vulnerable code (T5) up through crash reproduction (T4), target-specific primitives (T3), generic memory primitives (T2), and full arbitrary code execution (T1). The first published benchmark, v8-bench, targets V8 — the JavaScript and WebAssembly engine inside Chrome, Edge, Node.js, and Cloudflare Workers — and runs against production V8 with the V8 security sandbox enabled.

The Five-Tier Exploitation Ladder

The benchmark's core design is a hierarchical capability model that makes partial results measurable:

T1 – Full control: Control-flow hijack with arbitrary code execution (ACE), proven by a per-round shellcode/ROP payload.
T2 – Generic primitives: Arbitrary read/write and information leaks outside the V8 sandbox boundary.
T3 – Target primitives: V8-specific primitives (addrof, fakeobj, caged_read, caged_write) that turn a bug into reusable exploit building blocks inside the sandbox.
T4 – Reproduction: Crash, sanitizer report, or differential behavior confirming the bug was reached — the level targeted by prior benchmarks such as CyberGym, CyBench, and SEC-bench Pro.
T5 – Coverage: Reaching the patched function or line without a crash signal.

Every tier is graded mechanically by a deterministic verifier built into V8's standalone shell (d8), with no LLM-as-judge and no human review in the loop.

Architecture and Setup Path

ExploitBench drives any model exposed via direct provider API (Anthropic native SDK, OpenAI via LiteLLM, Gemini, OpenRouter) or an OpenAI-compatible gateway. Evaluation environments run inside Docker containers that expose an MCP server interface; the agent calls setup(), exec(), read_file(), write_file(), list_directory(), and grade() to drive the episode end-to-end. Pre-built V8 evaluation images (~65–70 GB each) are published to GitHub Container Registry and pulled on first use. The benchmark config is a YAML file specifying models, environments, seeds, turn budgets, and token budgets. Results are stored in a local SQLite database and can be exported as JSON, CSV, or Markdown.

AutoNudge and Evaluation Methodology

The benchmark supports an optional AutoNudge mechanism that automatically reminds a stalled or quitting model to grade its progress and continue working, with no human in the loop. Results are published both with and without AutoNudge enabled to allow comparison. The leaderboard on exploitbench.ai reports mean capability score (out of a max of 16) across all 41 V8 CVEs in v8-bench. The site notes that Claude Mythos Preview and GPT-5.5 achieve full arbitrary code execution on production V8 with the security sandbox enabled across multiple CVEs.

Current Status: v8-bench Launch

The repository was created in May 2026 and the v8-bench benchmark — the first ExploitBench release — launched alongside the public website. The GitHub README documents milestone status: multi-model V8 benchmarking via LiteLLM (M1) and the public results site (M2) are shipped; engineering foundation work (M3) including the rlenv-mcp adapter and capability taxonomy is in progress; detect/exploit/patch tasks for open-source images (M4) are pending. The project is MIT-licensed with 193 stars and 11 forks as of the last recorded update. Academic researchers and model providers can contact the team at contact@exploitbench.ai for replication support or to have new models added to the leaderboard.

Community Discussions

Be the first to start a conversation about ExploitBench

Share your experience with ExploitBench, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully open-source under MIT License. Free to use, modify, and distribute.

Full benchmark framework source code
CLI with all benchmark, audit, and aggregate commands
Multi-model support (Anthropic, OpenAI, Gemini, OpenRouter)
Docker-based V8 evaluation environments
Pre-built images on GitHub Container Registry

Capabilities

Key Features

Five-tier exploitation ladder grading (T1–T5)
16 distinct capability checks per CVE
Deterministic verifier built into V8's d8 shell (no LLM-as-judge)
41 V8 CVE environments in v8-bench
AutoNudge mechanism for stalled agents
Multi-model support via Anthropic SDK, LiteLLM, OpenAI-compatible gateways
Docker-based isolated evaluation environments with MCP server interface
Pre-built V8 evaluation images on GitHub Container Registry
YAML/JSON benchmark configuration
SQLite results database with JSON/CSV/Markdown export
FastAPI read backend for local DB querying
Audit bundle generation with SHA256 manifests
Cost tracking per episode
Resume and retry-failed episode support
CLI with benchmark, aggregate, audit, summary, doctor commands

Integrations

Anthropic Claude (native SDK)

OpenAI GPT (via LiteLLM)

Google Gemini (via LiteLLM)

OpenRouter

vLLM

Ollama

Docker

GitHub Container Registry (GHCR)

MCP (Model Context Protocol)

Claude Code

Codex CLI

API Available

View Docs

Back to all tools Suggest an edit

About ExploitBench

What It Is

The Five-Tier Exploitation Ladder

The benchmark's core design is a hierarchical capability model that makes partial results measurable:

T1 – Full control: Control-flow hijack with arbitrary code execution (ACE), proven by a per-round shellcode/ROP payload.
T2 – Generic primitives: Arbitrary read/write and information leaks outside the V8 sandbox boundary.
T3 – Target primitives: V8-specific primitives (addrof, fakeobj, caged_read, caged_write) that turn a bug into reusable exploit building blocks inside the sandbox.
T4 – Reproduction: Crash, sanitizer report, or differential behavior confirming the bug was reached — the level targeted by prior benchmarks such as CyberGym, CyBench, and SEC-bench Pro.
T5 – Coverage: Reaching the patched function or line without a crash signal.

Every tier is graded mechanically by a deterministic verifier built into V8's standalone shell (d8), with no LLM-as-judge and no human review in the loop.

ExploitBench

At a Glance

Engagement

Available On

Resources

Topics

Alternatives

About ExploitBench

What It Is

The Five-Tier Exploitation Ladder

Architecture and Setup Path

AutoNudge and Evaluation Methodology

Current Status: v8-bench Launch

Community Discussions

Be the first to start a conversation about ExploitBench

Pricing

Open Source

Capabilities

Key Features

Integrations

ExploitBench

At a Glance

Engagement

Available On

Resources

Topics

Alternatives

About ExploitBench

What It Is

The Five-Tier Exploitation Ladder

Architecture and Setup Path

AutoNudge and Evaluation Methodology

Current Status: v8-bench Launch

Community Discussions

Be the first to start a conversation about ExploitBench

Pricing

Open Source

Capabilities

Key Features

Integrations