ProgramBench

Name: ProgramBench
Availability: OnlineOnly
Author: Meta AI

A benchmark that tests whether AI agents can rebuild real-world programs from scratch given only a compiled binary and its documentation, with no access to source code.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source under the MIT License. Install via pip or uvx.

Engagement

Available On

API

VS Code

CLI

Meta AIMenlo ParkEst. 2004$2.3B raised

Updated Jun 2026

About ProgramBench

ProgramBench is an open-source benchmark from Meta Superintelligence Labs, Stanford University, and Harvard University that asks a deceptively hard question: can language models rebuild programs from scratch? Given only a compiled binary and its documentation, AI agents must architect and implement a complete codebase that reproduces the original program's behavior — with no source code, no decompilation, and no internet access.

What It Is

ProgramBench is a software engineering evaluation benchmark designed to measure the full-stack architectural and implementation capabilities of AI coding agents. Unlike most coding benchmarks that provide method signatures, class skeletons, or product requirement documents, ProgramBench gives agents no structural hints whatsoever. The agent must choose a programming language, design the architecture, write all source code, and produce a build script entirely on its own. A candidate solution passes only if it clears all behavioral tests for a given task.

Task Design and Scope

The benchmark comprises 200 tasks drawn from real open-source repositories, spanning a wide range of complexity:

Small terminal utilities: tools like jq, ripgrep, fzf, bat, and zoxide
Mid-size projects: tools like pandoc, typst, tree-sitter, and DuckDB
Massive software projects: the PHP compiler, FFmpeg, SQLite, and GROMACS

The test suite is generated via agent-driven fuzzing and comprises more than 248,000 total behavioral tests across all 200 tasks. All reference executables pass the test suites, confirming the benchmark is solvable by design.

Anti-Cheating Architecture

ProgramBench takes substantial precautions to prevent shortcuts. Agents run in sandboxed containers with no internet access, execute-only permissions on the binary, and no access to decompilation tools. The paper reports that in early trials without these restrictions, models found shortcuts such as cloning source repositories from GitHub or downloading code through package managers. The benchmark blocks decompilation by granting the binary only execution permissions — operations like objdump, strings, hexdump, or running a disassembler all fail. The benchmark also includes a different-language ablation (forcing models to implement in a different language than the original) to measure and control for memorization effects.

Leaderboard and Current Scores

The leaderboard is evaluated using mini-SWE-agent, chosen because it is widely adopted as a baseline by other benchmarks (SWE-bench Verified, SWE-bench Multilingual, Terminal-bench) and deliberately minimal in scaffolding. As of the May 11, 2026 update, the top-performing model (GPT 5.5 at xhigh compute) achieves only 0.5% fully resolved instances across 200 tasks, with 13.5% "almost resolved" (≥95% of tests passing). Most evaluated models — including Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5 mini — score 0% on fully resolved instances. The benchmark deliberately includes tasks of varying difficulty to distinguish model capability from scaffold design.

Update: v1.0.2

The project reached v1.0.2 on May 11, 2026, shortly after its initial release on May 3, 2026. The accompanying paper (arXiv:2605.03546) by John Yang, Kilian Lieret, and co-authors from Meta Superintelligence Labs, Stanford, and Harvard provides detailed ablations on inference settings, cheating prevention, and metric design. A public submission portal for the leaderboard is listed as coming soon. The repository is licensed under the MIT License and hosted under the facebookresearch GitHub organization.

Why It Matters

ProgramBench targets a capability gap that prior benchmarks abstract away: free-form software architecture. Rather than filling in blanks, agents must make every design decision — what abstractions to introduce, how to decompose functionality across modules, and what interfaces to expose. The benchmark's authors argue that headline scores from harness-tuned, curated task sets can substantially overstate real agent capability, and deliberately avoid per-task harness tuning to provide a more honest signal. The extremely low current scores are presented as evidence of inadequate model capabilities rather than benchmark design flaws.

Community Discussions

Be the first to start a conversation about ProgramBench

Share your experience with ProgramBench, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully free and open-source under the MIT License. Install via pip or uvx.

200 program reconstruction tasks
248,000+ behavioral tests
Public leaderboard access
HuggingFace dataset access
pip and uvx installation

Capabilities

Key Features

200 real-world program reconstruction tasks
248,000+ behavioral tests via agent-driven fuzzing
Sandboxed execution with no internet access
No decompilation allowed (execute-only binary permissions)
Public leaderboard with resolved and almost-resolved metrics
Extended results with per-task and per-model breakdowns
Different-language ablation to control for memorization
Installable via pip or uvx
HuggingFace dataset of test cases
Tasks range from small CLI tools to massive compilers

Integrations

mini-SWE-agent

HuggingFace Datasets

uv / uvx

pip

OpenAI GPT models

Anthropic Claude models

Google Gemini models

API Available

View Docs

Back to all tools Suggest an edit

About ProgramBench

What It Is

Task Design and Scope

The benchmark comprises 200 tasks drawn from real open-source repositories, spanning a wide range of complexity:

Small terminal utilities: tools like jq, ripgrep, fzf, bat, and zoxide
Mid-size projects: tools like pandoc, typst, tree-sitter, and DuckDB
Massive software projects: the PHP compiler, FFmpeg, SQLite, and GROMACS

ProgramBench