EveryDev.ai
Sign inSubscribe
Explore AI Tools
  • AI Coding Assistants
  • Agent Frameworks
  • MCP Servers
  • AI Prompt Tools
  • Vibe Coding Tools
  • AI Design Tools
  • AI Database Tools
  • AI Website Builders
  • AI Testing Tools
  • LLM Evaluations
Follow Us
  • X / Twitter
  • LinkedIn
  • Reddit
  • Discord
  • Threads
  • Bluesky
  • Mastodon
  • YouTube
  • GitHub
  • Instagram
Get Started
  • About
  • Editorial Standards
  • Corrections & Disclosures
  • Community Guidelines
  • Advertise
  • Contact Us
  • Newsletter
  • Submit a Tool
  • Start a Discussion
  • Write A Blog
  • Share A Build
  • Terms of Service
  • Privacy Policy
Explore with AI
  • ChatGPT
  • Gemini
  • Claude
  • Grok
  • Perplexity
Agent Experience
  • llms.txt
Theme
With AI, Everyone is a Dev. EveryDev.ai © 2026
Main Menu
  • Tools
  • Developers
  • Topics
  • Discussions
  • Communities
  • News
  • Podcasts
  • Blogs
  • Builds
  • Contests
  • Compare
  • Arena
Create
    Home
    Tools

    2,424+ AI tools

    • New
    • Trending
    • Featured
    • Compare
    • Arena
    Categories
    • Agents1573
    • Coding1176
    • Infrastructure524
    • Marketing445
    • Design422
    • Projects381
    • Research354
    • Analytics328
    • Testing219
    • MCP210
    • Data203
    • Security192
    • Integration168
    • Learning154
    • Communication145
    • Prompts140
    • Extensions135
    • Commerce123
    • Voice122
    • DevOps98
    • Web76
    • Finance21
    1. Home
    2. Tools
    3. ProgramBench
    ProgramBench icon

    ProgramBench

    LLM Evaluations

    A benchmark that tests whether AI agents can rebuild real-world programs from scratch given only a compiled binary and its documentation, with no access to source code.

    Visit Website

    At a Glance

    Pricing
    Open Source

    Fully free and open-source under the MIT License. Install via pip or uvx.

    Engagement

    Available On

    API
    VS Code
    CLI

    Resources

    WebsiteDocsGitHubllms.txt

    Topics

    LLM EvaluationsAgent HarnessAI Coding Assistants

    Alternatives

    harness-kitmdarenaWebArena
    Developer
    Meta FAIR (facebookresearch)Menlo Park, CAEst. 2013$16B raised

    Listed May 2026

    About ProgramBench

    ProgramBench is an open-source benchmark from Meta Superintelligence Labs, Stanford University, and Harvard University that asks a deceptively hard question: can language models rebuild programs from scratch? Given only a compiled binary and its documentation, AI agents must architect and implement a complete codebase that reproduces the original program's behavior — with no source code, no decompilation, and no internet access.

    What It Is

    ProgramBench is a software engineering evaluation benchmark designed to measure the full-stack architectural and implementation capabilities of AI coding agents. Unlike most coding benchmarks that provide method signatures, class skeletons, or product requirement documents, ProgramBench gives agents no structural hints whatsoever. The agent must choose a programming language, design the architecture, write all source code, and produce a build script entirely on its own. A candidate solution passes only if it clears all behavioral tests for a given task.

    Task Design and Scope

    The benchmark comprises 200 tasks drawn from real open-source repositories, spanning a wide range of complexity:

    • Small terminal utilities: tools like jq, ripgrep, fzf, bat, and zoxide
    • Mid-size projects: tools like pandoc, typst, tree-sitter, and DuckDB
    • Massive software projects: the PHP compiler, FFmpeg, SQLite, and GROMACS

    The test suite is generated via agent-driven fuzzing and comprises more than 248,000 total behavioral tests across all 200 tasks. All reference executables pass the test suites, confirming the benchmark is solvable by design.

    Anti-Cheating Architecture

    ProgramBench takes substantial precautions to prevent shortcuts. Agents run in sandboxed containers with no internet access, execute-only permissions on the binary, and no access to decompilation tools. The paper reports that in early trials without these restrictions, models found shortcuts such as cloning source repositories from GitHub or downloading code through package managers. The benchmark blocks decompilation by granting the binary only execution permissions — operations like objdump, strings, hexdump, or running a disassembler all fail. The benchmark also includes a different-language ablation (forcing models to implement in a different language than the original) to measure and control for memorization effects.

    Leaderboard and Current Scores

    The leaderboard is evaluated using mini-SWE-agent, chosen because it is widely adopted as a baseline by other benchmarks (SWE-bench Verified, SWE-bench Multilingual, Terminal-bench) and deliberately minimal in scaffolding. As of the May 11, 2026 update, the top-performing model (GPT 5.5 at xhigh compute) achieves only 0.5% fully resolved instances across 200 tasks, with 13.5% "almost resolved" (≥95% of tests passing). Most evaluated models — including Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5 mini — score 0% on fully resolved instances. The benchmark deliberately includes tasks of varying difficulty to distinguish model capability from scaffold design.

    Update: v1.0.2

    The project reached v1.0.2 on May 11, 2026, shortly after its initial release on May 3, 2026. The accompanying paper (arXiv:2605.03546) by John Yang, Kilian Lieret, and co-authors from Meta Superintelligence Labs, Stanford, and Harvard provides detailed ablations on inference settings, cheating prevention, and metric design. A public submission portal for the leaderboard is listed as coming soon. The repository is licensed under the MIT License and hosted under the facebookresearch GitHub organization.

    Why It Matters

    ProgramBench targets a capability gap that prior benchmarks abstract away: free-form software architecture. Rather than filling in blanks, agents must make every design decision — what abstractions to introduce, how to decompose functionality across modules, and what interfaces to expose. The benchmark's authors argue that headline scores from harness-tuned, curated task sets can substantially overstate real agent capability, and deliberately avoid per-task harness tuning to provide a more honest signal. The extremely low current scores are presented as evidence of inadequate model capabilities rather than benchmark design flaws.

    ProgramBench - 1

    Community Discussions

    Be the first to start a conversation about ProgramBench

    Share your experience with ProgramBench, ask questions, or help others learn from your insights.

    Pricing

    OPEN SOURCE

    Open Source

    Fully free and open-source under the MIT License. Install via pip or uvx.

    • 200 program reconstruction tasks
    • 248,000+ behavioral tests
    • Public leaderboard access
    • HuggingFace dataset access
    • pip and uvx installation

    Capabilities

    Key Features

    • 200 real-world program reconstruction tasks
    • 248,000+ behavioral tests via agent-driven fuzzing
    • Sandboxed execution with no internet access
    • No decompilation allowed (execute-only binary permissions)
    • Public leaderboard with resolved and almost-resolved metrics
    • Extended results with per-task and per-model breakdowns
    • Different-language ablation to control for memorization
    • Installable via pip or uvx
    • HuggingFace dataset of test cases
    • Tasks range from small CLI tools to massive compilers

    Integrations

    mini-SWE-agent
    HuggingFace Datasets
    uv / uvx
    pip
    OpenAI GPT models
    Anthropic Claude models
    Google Gemini models
    API Available
    View Docs

    Reviews & Ratings

    No ratings yet

    Be the first to rate ProgramBench and help others make informed decisions.

    Developer

    Meta FAIR (facebookresearch)

    Meta FAIR (Fundamental AI Research) publishes open-source AI research tools, datasets, and benchmarks. The facebookresearch GitHub organization hosts projects spanning NLP, computer vision, and software engineering evaluation. ProgramBench is a collaboration between Meta Superintelligence Labs, Stanford University, and Harvard University, led by researchers including John Yang, Kilian Lieret, and Ofir Press.

    Founded 2013
    Menlo Park, CA
    $16B raised
    70,000 employees

    Used by

    Reliance Industries
    AWS
    NVIDIA
    Microsoft
    +2 more
    Read more about Meta FAIR (facebookresearch)
    WebsiteGitHubLinkedIn
    1 tool in directory

    Similar Tools

    harness-kit icon

    harness-kit

    A Python toolkit for building and evaluating AI agent harnesses, enabling structured testing and benchmarking of LLM-based agents.

    mdarena icon

    mdarena

    Benchmark your CLAUDE.md files against real merged PRs to measure whether your AI agent context files help or hurt performance and token costs.

    WebArena icon

    WebArena

    A standalone, self-hostable web environment for building and evaluating autonomous web agents on realistic tasks.

    Browse all tools

    Related Topics

    LLM Evaluations

    Platforms and frameworks for evaluating, testing, and benchmarking LLM systems and AI applications. These tools provide evaluators and evaluation models to score AI outputs, measure hallucinations, assess RAG quality, detect failures, and optimize model performance. Features include automated testing with LLM-as-a-judge metrics, component-level evaluation with tracing, regression testing in CI/CD pipelines, custom evaluator creation, dataset curation, and real-time monitoring of production systems. Teams use these solutions to validate prompt effectiveness, compare models side-by-side, ensure answer correctness and relevance, identify bias and toxicity, prevent PII leakage, and continuously improve AI product quality through experiments, benchmarks, and performance analytics.

    77 tools

    Agent Harness

    Infrastructure, orchestrators, and task runners that wrap around LLM coding agents — covering session management, context delivery, worktree isolation, architecture enforcement, and issue-to-PR pipelines.

    79 tools

    AI Coding Assistants

    AI tools that help write, edit, and understand code with intelligent suggestions.

    457 tools
    Browse all topics
    Back to all tools
    Discussions