SWE-bench
A benchmark for evaluating large language models on real-world GitHub issues, tasking models to generate patches that resolve described software problems.
At a Glance
Fully free and open-source under the MIT License. Use, modify, and distribute freely.
Engagement
Available On
Alternatives
Listed May 2026
About SWE-bench
SWE-bench is an open-source benchmark created by researchers at Princeton and Stanford to measure how well large language models can resolve real-world software engineering issues collected from GitHub. Given a codebase and an issue description, a language model must generate a patch that fixes the problem — making it one of the most concrete and reproducible evaluations of AI coding capability available. The project was accepted as an oral presentation at ICLR 2024 and has since expanded into a family of related benchmarks and tools.
What It Is
SWE-bench frames software engineering as a task: given a repository and a GitHub issue, can a model produce a working patch? The benchmark draws from real issues filed against popular Python projects, making it substantially harder than synthetic coding tasks. The evaluation harness runs candidate patches inside Docker containers to verify correctness in a reproducible environment. The leaderboard at swebench.com tracks resolved-percentage scores across hundreds of model and agent combinations.
Benchmark Variants
The SWE-bench family has grown to cover several evaluation scenarios:
- SWE-bench Full — the original 2,294-instance test set of real GitHub issues
- SWE-bench Lite — a curated subset designed for less costly evaluation (300 instances)
- SWE-bench Verified — 500 instances confirmed solvable by real software engineers, developed in collaboration with OpenAI Preparedness
- SWE-bench Multimodal — 517 instances that include visual elements such as screenshots and diagrams, accepted at ICLR 2025
- SWE-bench Multilingual — 300 tasks spanning 9 programming languages
Architecture and Evaluation Setup
Evaluation runs entirely inside Docker containers, which the project switched to in June 2024 for reproducibility. The recommended hardware is an x86_64 machine with at least 120 GB of free storage, 16 GB of RAM, and 8 CPU cores. Cloud-based evaluation is also supported via Modal or the companion sb-cli tool that runs evaluations automatically on AWS. The Python package is installable via pip (swebench) and the datasets are hosted on Hugging Face under the princeton-nlp and SWE-bench organizations.
Companion Models and Datasets
The repository ships pre-processed retrieval datasets (BM25 at 13K, 27K, 40K, and 50K token budgets) and fine-tuned SWE-Llama models (7B and 13B, with and without PEFT adapters) to support research into both inference and training. The related SWE-smith toolkit, announced in May 2025, provides a dedicated pipeline for generating synthetic software engineering training data and was used to train SWE-agent-LM-32B, which the project page describes as the open-weight state-of-the-art on SWE-bench Verified as of April 2025.
Update: Multimodal Integration and Leaderboard Activity (2025)
As of January 2025, SWE-bench Multimodal was integrated into the main repository, with test-split evaluation kept private and submissions routed through sb-cli. The leaderboard is actively updated; as of early 2026 the top entries on SWE-bench Verified exceed 76% resolved, with entries from Anthropic, Google, OpenAI, DeepSeek, and open-weight models all represented. The project acknowledges support from Open Philanthropy, AWS, Modal, Andreessen Horowitz, OpenAI, and Anthropic.
Community Discussions
Be the first to start a conversation about SWE-bench
Share your experience with SWE-bench, ask questions, or help others learn from your insights.
Pricing
Open Source
Fully free and open-source under the MIT License. Use, modify, and distribute freely.
- MIT License
- Full benchmark datasets on HuggingFace
- Docker-based evaluation harness
- SWE-bench Lite, Verified, Multimodal, Multilingual variants
- Pre-processed retrieval datasets
Capabilities
Key Features
- Real-world GitHub issue benchmark
- Docker-based reproducible evaluation harness
- SWE-bench Verified (500 human-confirmed solvable instances)
- SWE-bench Lite (300-instance subset for cost-efficient evaluation)
- SWE-bench Multimodal (visual software engineering tasks)
- SWE-bench Multilingual (9 programming languages)
- Public leaderboard with % Resolved metric
- Cloud evaluation via Modal and sb-cli (AWS)
- Pre-processed BM25 retrieval datasets
- Fine-tuned SWE-Llama 7B and 13B models
- HuggingFace dataset integration
- Custom data collection pipeline for new repositories
- Inference support for local and API-based models
