ProgramBench
A benchmark that tests whether AI agents can rebuild real-world programs from scratch given only a compiled binary and its documentation, with no access to source code.
At a Glance
About ProgramBench
ProgramBench is an open-source benchmark from Meta Superintelligence Labs, Stanford University, and Harvard University that asks a deceptively hard question: can language models rebuild programs from scratch? Given only a compiled binary and its documentation, AI agents must architect and implement a complete codebase that reproduces the original program's behavior — with no source code, no decompilation, and no internet access.
What It Is
ProgramBench is a software engineering evaluation benchmark designed to measure the full-stack architectural and implementation capabilities of AI coding agents. Unlike most coding benchmarks that provide method signatures, class skeletons, or product requirement documents, ProgramBench gives agents no structural hints whatsoever. The agent must choose a programming language, design the architecture, write all source code, and produce a build script entirely on its own. A candidate solution passes only if it clears all behavioral tests for a given task.
Task Design and Scope
The benchmark comprises 200 tasks drawn from real open-source repositories, spanning a wide range of complexity:
- Small terminal utilities: tools like
jq,ripgrep,fzf,bat, andzoxide - Mid-size projects: tools like
pandoc,typst,tree-sitter, andDuckDB - Massive software projects: the PHP compiler, FFmpeg, SQLite, and GROMACS
The test suite is generated via agent-driven fuzzing and comprises more than 248,000 total behavioral tests across all 200 tasks. All reference executables pass the test suites, confirming the benchmark is solvable by design.
Anti-Cheating Architecture
ProgramBench takes substantial precautions to prevent shortcuts. Agents run in sandboxed containers with no internet access, execute-only permissions on the binary, and no access to decompilation tools. The paper reports that in early trials without these restrictions, models found shortcuts such as cloning source repositories from GitHub or downloading code through package managers. The benchmark blocks decompilation by granting the binary only execution permissions — operations like objdump, strings, hexdump, or running a disassembler all fail. The benchmark also includes a different-language ablation (forcing models to implement in a different language than the original) to measure and control for memorization effects.
Leaderboard and Current Scores
The leaderboard is evaluated using mini-SWE-agent, chosen because it is widely adopted as a baseline by other benchmarks (SWE-bench Verified, SWE-bench Multilingual, Terminal-bench) and deliberately minimal in scaffolding. As of the May 11, 2026 update, the top-performing model (GPT 5.5 at xhigh compute) achieves only 0.5% fully resolved instances across 200 tasks, with 13.5% "almost resolved" (≥95% of tests passing). Most evaluated models — including Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5 mini — score 0% on fully resolved instances. The benchmark deliberately includes tasks of varying difficulty to distinguish model capability from scaffold design.
Update: v1.0.2
The project reached v1.0.2 on May 11, 2026, shortly after its initial release on May 3, 2026. The accompanying paper (arXiv:2605.03546) by John Yang, Kilian Lieret, and co-authors from Meta Superintelligence Labs, Stanford, and Harvard provides detailed ablations on inference settings, cheating prevention, and metric design. A public submission portal for the leaderboard is listed as coming soon. The repository is licensed under the MIT License and hosted under the facebookresearch GitHub organization.
Why It Matters
ProgramBench targets a capability gap that prior benchmarks abstract away: free-form software architecture. Rather than filling in blanks, agents must make every design decision — what abstractions to introduce, how to decompose functionality across modules, and what interfaces to expose. The benchmark's authors argue that headline scores from harness-tuned, curated task sets can substantially overstate real agent capability, and deliberately avoid per-task harness tuning to provide a more honest signal. The extremely low current scores are presented as evidence of inadequate model capabilities rather than benchmark design flaws.
Community Discussions
Be the first to start a conversation about ProgramBench
Share your experience with ProgramBench, ask questions, or help others learn from your insights.
Pricing
Open Source
Fully free and open-source under the MIT License. Install via pip or uvx.
- 200 program reconstruction tasks
- 248,000+ behavioral tests
- Public leaderboard access
- HuggingFace dataset access
- pip and uvx installation
Capabilities
Key Features
- 200 real-world program reconstruction tasks
- 248,000+ behavioral tests via agent-driven fuzzing
- Sandboxed execution with no internet access
- No decompilation allowed (execute-only binary permissions)
- Public leaderboard with resolved and almost-resolved metrics
- Extended results with per-task and per-model breakdowns
- Different-language ablation to control for memorization
- Installable via pip or uvx
- HuggingFace dataset of test cases
- Tasks range from small CLI tools to massive compilers
