EveryDev.ai
Sign inSubscribe
AI Tools by Topic
  • AI Coding Assistants
  • Agent Frameworks
  • MCP Servers
  • AI Prompt Tools
  • Vibe Coding Tools
  • AI Design Tools
  • AI Database Tools
  • AI Website Builders
  • AI Testing Tools
  • LLM Evaluations
Follow Us
  • X / Twitter
  • LinkedIn
  • Reddit
  • Discord
  • Threads
  • Bluesky
  • Mastodon
  • YouTube
  • GitHub
  • Instagram
Get Started
  • About
  • Editorial Standards
  • Corrections & Disclosures
  • Community Guidelines
  • Advertise
  • Contact Us
  • Newsletter
  • Submit a Tool
  • Start a Discussion
  • Write A Blog
  • Share A Build
  • Terms of Service
  • Privacy Policy
Explore with AI
  • ChatGPT
  • Gemini
  • Claude
  • Grok
  • Perplexity
Agent Experience
  • llms.txt
Theme
With AI, Everyone is a Dev. EveryDev.ai © 2026
Main Menu
  • Tools
  • Developers
  • Topics
  • Discussions
  • Communities
  • News
  • Podcasts
  • Blogs
  • Builds
  • Contests
  • Compare
  • Arena
Create
    Home
    Tools

    2,508+ AI tools

    • New
    • Trending
    • Featured
    • Compare
    • Arena
    Categories
    • Agents1666
    • Coding1214
    • Infrastructure542
    • Marketing451
    • Design437
    • Projects396
    • Research371
    • Analytics339
    • Testing233
    • MCP227
    • Data213
    • Security200
    • Integration170
    • Learning155
    • Communication148
    • Prompts144
    • Extensions137
    • Commerce125
    • Voice122
    • DevOps99
    • Web78
    • Finance21
    1. Home
    2. Tools
    3. ExploitBench
    ExploitBench icon

    ExploitBench

    LLM Evaluations
    Featured

    ExploitBench measures how far AI agents can climb the exploitation ladder, from reaching vulnerable code to achieving arbitrary code execution, using a five-tier grading system against real CVEs.

    Visit Website

    At a Glance

    Pricing
    Open Source

    Fully open-source under MIT License. Free to use, modify, and distribute.

    Engagement

    Available On

    CLI
    API
    Web

    Resources

    WebsiteDocsGitHubllms.txt

    Topics

    LLM EvaluationsSecurity TestingAgent Harness

    Alternatives

    VerifiersGiskardInferenceBench
    Developer
    ExploitBenchPittsburgh, PAEst. 2024

    Listed May 2026

    About ExploitBench

    ExploitBench is an open-source AI security benchmark created by Seunghyun Lee and Prof. David Brumley at Carnegie Mellon University. It evaluates AI agent capability across the full exploitation pipeline — not just whether a bug can be triggered, but how far an agent can progress toward arbitrary code execution. The project is publicly available on GitHub under the MIT License and publishes live leaderboard results at exploitbench.ai.

    What It Is

    ExploitBench is a benchmark framework for measuring AI agent exploitation capability against real-world vulnerabilities. Unlike prior benchmarks that score a binary pass/fail on whether an exploit works, ExploitBench grades each of 16 distinct capabilities organized into five tiers — from reaching vulnerable code (T5) up through crash reproduction (T4), target-specific primitives (T3), generic memory primitives (T2), and full arbitrary code execution (T1). The first published benchmark, v8-bench, targets V8 — the JavaScript and WebAssembly engine inside Chrome, Edge, Node.js, and Cloudflare Workers — and runs against production V8 with the V8 security sandbox enabled.

    The Five-Tier Exploitation Ladder

    The benchmark's core design is a hierarchical capability model that makes partial results measurable:

    • T1 – Full control: Control-flow hijack with arbitrary code execution (ACE), proven by a per-round shellcode/ROP payload.
    • T2 – Generic primitives: Arbitrary read/write and information leaks outside the V8 sandbox boundary.
    • T3 – Target primitives: V8-specific primitives (addrof, fakeobj, caged_read, caged_write) that turn a bug into reusable exploit building blocks inside the sandbox.
    • T4 – Reproduction: Crash, sanitizer report, or differential behavior confirming the bug was reached — the level targeted by prior benchmarks such as CyberGym, CyBench, and SEC-bench Pro.
    • T5 – Coverage: Reaching the patched function or line without a crash signal.

    Every tier is graded mechanically by a deterministic verifier built into V8's standalone shell (d8), with no LLM-as-judge and no human review in the loop.

    Architecture and Setup Path

    ExploitBench drives any model exposed via direct provider API (Anthropic native SDK, OpenAI via LiteLLM, Gemini, OpenRouter) or an OpenAI-compatible gateway. Evaluation environments run inside Docker containers that expose an MCP server interface; the agent calls setup(), exec(), read_file(), write_file(), list_directory(), and grade() to drive the episode end-to-end. Pre-built V8 evaluation images (~65–70 GB each) are published to GitHub Container Registry and pulled on first use. The benchmark config is a YAML file specifying models, environments, seeds, turn budgets, and token budgets. Results are stored in a local SQLite database and can be exported as JSON, CSV, or Markdown.

    AutoNudge and Evaluation Methodology

    The benchmark supports an optional AutoNudge mechanism that automatically reminds a stalled or quitting model to grade its progress and continue working, with no human in the loop. Results are published both with and without AutoNudge enabled to allow comparison. The leaderboard on exploitbench.ai reports mean capability score (out of a max of 16) across all 41 V8 CVEs in v8-bench. The site notes that Claude Mythos Preview and GPT-5.5 achieve full arbitrary code execution on production V8 with the security sandbox enabled across multiple CVEs.

    Current Status: v8-bench Launch

    The repository was created in May 2026 and the v8-bench benchmark — the first ExploitBench release — launched alongside the public website. The GitHub README documents milestone status: multi-model V8 benchmarking via LiteLLM (M1) and the public results site (M2) are shipped; engineering foundation work (M3) including the rlenv-mcp adapter and capability taxonomy is in progress; detect/exploit/patch tasks for open-source images (M4) are pending. The project is MIT-licensed with 193 stars and 11 forks as of the last recorded update. Academic researchers and model providers can contact the team at contact@exploitbench.ai for replication support or to have new models added to the leaderboard.

    ExploitBench - 1

    Community Discussions

    Be the first to start a conversation about ExploitBench

    Share your experience with ExploitBench, ask questions, or help others learn from your insights.

    Pricing

    OPEN SOURCE

    Open Source

    Fully open-source under MIT License. Free to use, modify, and distribute.

    • Full benchmark framework source code
    • CLI with all benchmark, audit, and aggregate commands
    • Multi-model support (Anthropic, OpenAI, Gemini, OpenRouter)
    • Docker-based V8 evaluation environments
    • Pre-built images on GitHub Container Registry

    Capabilities

    Key Features

    • Five-tier exploitation ladder grading (T1–T5)
    • 16 distinct capability checks per CVE
    • Deterministic verifier built into V8's d8 shell (no LLM-as-judge)
    • 41 V8 CVE environments in v8-bench
    • AutoNudge mechanism for stalled agents
    • Multi-model support via Anthropic SDK, LiteLLM, OpenAI-compatible gateways
    • Docker-based isolated evaluation environments with MCP server interface
    • Pre-built V8 evaluation images on GitHub Container Registry
    • YAML/JSON benchmark configuration
    • SQLite results database with JSON/CSV/Markdown export
    • FastAPI read backend for local DB querying
    • Audit bundle generation with SHA256 manifests
    • Cost tracking per episode
    • Resume and retry-failed episode support
    • CLI with benchmark, aggregate, audit, summary, doctor commands

    Integrations

    Anthropic Claude (native SDK)
    OpenAI GPT (via LiteLLM)
    Google Gemini (via LiteLLM)
    OpenRouter
    vLLM
    Ollama
    Docker
    GitHub Container Registry (GHCR)
    MCP (Model Context Protocol)
    Claude Code
    Codex CLI
    API Available
    View Docs

    Reviews & Ratings

    No ratings yet

    Be the first to rate ExploitBench and help others make informed decisions.

    Developer

    ExploitBench Team

    ExploitBench builds open-source benchmarks for measuring AI agent exploitation capability against real-world vulnerabilities. The project is led by Seunghyun Lee and Prof. David Brumley from Carnegie Mellon University. It publishes live leaderboard results and per-CVE drilldowns at exploitbench.ai, and releases evaluation environments as Docker images on GitHub Container Registry. The team supports academic researchers and model providers seeking to replicate experiments or add new models to the benchmark.

    Founded 2024
    Pittsburgh, PA
    15 employees

    Used by

    Anthropic
    Bugcrowd
    Carnegie Mellon University
    Read more about ExploitBench Team
    WebsiteGitHubLinkedInX / Twitter
    1 tool in directory

    Similar Tools

    Verifiers icon

    Verifiers

    An open-source Python library by Prime Intellect for creating environments to train and evaluate LLMs using reinforcement learning.

    Giskard icon

    Giskard

    Automated testing platform for LLM agents that detects hallucinations, security vulnerabilities, and quality issues through continuous red teaming.

    InferenceBench icon

    InferenceBench

    An open-source benchmark that evaluates whether frontier AI coding agents can optimize LLM serving workloads under a fixed compute budget across four inference scenarios.

    Browse all tools

    Related Topics

    LLM Evaluations

    Platforms and frameworks for evaluating, testing, and benchmarking LLM systems and AI applications. These tools provide evaluators and evaluation models to score AI outputs, measure hallucinations, assess RAG quality, detect failures, and optimize model performance. Features include automated testing with LLM-as-a-judge metrics, component-level evaluation with tracing, regression testing in CI/CD pipelines, custom evaluator creation, dataset curation, and real-time monitoring of production systems. Teams use these solutions to validate prompt effectiveness, compare models side-by-side, ensure answer correctness and relevance, identify bias and toxicity, prevent PII leakage, and continuously improve AI product quality through experiments, benchmarks, and performance analytics.

    87 tools

    Security Testing

    Tools for automated security testing and penetration testing.

    11 tools

    Agent Harness

    Infrastructure, orchestrators, and task runners that wrap around LLM coding agents — covering session management, context delivery, worktree isolation, architecture enforcement, and issue-to-PR pipelines.

    87 tools
    Browse all topics
    Back to all tools
    Discussions