# ProgramBench

> A benchmark that tests whether AI agents can rebuild real-world programs from scratch given only a compiled binary and its documentation, with no access to source code.

ProgramBench is an open-source benchmark from Meta Superintelligence Labs, Stanford University, and Harvard University that asks a deceptively hard question: can language models rebuild programs from scratch? Given only a compiled binary and its documentation, AI agents must architect and implement a complete codebase that reproduces the original program's behavior — with no source code, no decompilation, and no internet access.

## What It Is

ProgramBench is a software engineering evaluation benchmark designed to measure the full-stack architectural and implementation capabilities of AI coding agents. Unlike most coding benchmarks that provide method signatures, class skeletons, or product requirement documents, ProgramBench gives agents no structural hints whatsoever. The agent must choose a programming language, design the architecture, write all source code, and produce a build script entirely on its own. A candidate solution passes only if it clears all behavioral tests for a given task.

## Task Design and Scope

The benchmark comprises 200 tasks drawn from real open-source repositories, spanning a wide range of complexity:

- **Small terminal utilities**: tools like `jq`, `ripgrep`, `fzf`, `bat`, and `zoxide`
- **Mid-size projects**: tools like `pandoc`, `typst`, `tree-sitter`, and `DuckDB`
- **Massive software projects**: the PHP compiler, FFmpeg, SQLite, and GROMACS

The test suite is generated via agent-driven fuzzing and comprises more than 248,000 total behavioral tests across all 200 tasks. All reference executables pass the test suites, confirming the benchmark is solvable by design.

## Anti-Cheating Architecture

ProgramBench takes substantial precautions to prevent shortcuts. Agents run in sandboxed containers with no internet access, execute-only permissions on the binary, and no access to decompilation tools. The paper reports that in early trials without these restrictions, models found shortcuts such as cloning source repositories from GitHub or downloading code through package managers. The benchmark blocks decompilation by granting the binary only execution permissions — operations like `objdump`, `strings`, `hexdump`, or running a disassembler all fail. The benchmark also includes a different-language ablation (forcing models to implement in a different language than the original) to measure and control for memorization effects.

## Leaderboard and Current Scores

The leaderboard is evaluated using mini-SWE-agent, chosen because it is widely adopted as a baseline by other benchmarks (SWE-bench Verified, SWE-bench Multilingual, Terminal-bench) and deliberately minimal in scaffolding. As of the May 11, 2026 update, the top-performing model (GPT 5.5 at xhigh compute) achieves only 0.5% fully resolved instances across 200 tasks, with 13.5% "almost resolved" (≥95% of tests passing). Most evaluated models — including Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5 mini — score 0% on fully resolved instances. The benchmark deliberately includes tasks of varying difficulty to distinguish model capability from scaffold design.

## Update: v1.0.2

The project reached v1.0.2 on May 11, 2026, shortly after its initial release on May 3, 2026. The accompanying paper (arXiv:2605.03546) by John Yang, Kilian Lieret, and co-authors from Meta Superintelligence Labs, Stanford, and Harvard provides detailed ablations on inference settings, cheating prevention, and metric design. A public submission portal for the leaderboard is listed as coming soon. The repository is licensed under the MIT License and hosted under the `facebookresearch` GitHub organization.

## Why It Matters

ProgramBench targets a capability gap that prior benchmarks abstract away: free-form software architecture. Rather than filling in blanks, agents must make every design decision — what abstractions to introduce, how to decompose functionality across modules, and what interfaces to expose. The benchmark's authors argue that headline scores from harness-tuned, curated task sets can substantially overstate real agent capability, and deliberately avoid per-task harness tuning to provide a more honest signal. The extremely low current scores are presented as evidence of inadequate model capabilities rather than benchmark design flaws.

## Features
- 200 real-world program reconstruction tasks
- 248,000+ behavioral tests via agent-driven fuzzing
- Sandboxed execution with no internet access
- No decompilation allowed (execute-only binary permissions)
- Public leaderboard with resolved and almost-resolved metrics
- Extended results with per-task and per-model breakdowns
- Different-language ablation to control for memorization
- Installable via pip or uvx
- HuggingFace dataset of test cases
- Tasks range from small CLI tools to massive compilers

## Integrations
mini-SWE-agent, HuggingFace Datasets, uv / uvx, pip, OpenAI GPT models, Anthropic Claude models, Google Gemini models

## Platforms
API, VSC_EXTENSION, CLI

## Pricing
Open Source

## Version
v1.0.2

## Links
- Website: https://programbench.com
- Documentation: https://github.com/facebookresearch/ProgramBench/blob/main/docs/README.md
- Repository: https://github.com/facebookresearch/ProgramBench
- EveryDev.ai: https://www.everydev.ai/tools/programbench
