SWE-bench

Name: SWE-bench
Availability: OnlineOnly
Author: SWE-bench

A benchmark for evaluating large language models on real-world GitHub issues, tasking models to generate patches that resolve described software problems.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source under the MIT License. Use, modify, and distribute freely.

Engagement

Available On

macOS

Linux

API

SDK

CLI

SWE-benchPrinceton, NJEst. 2023

Listed May 2026

About SWE-bench

SWE-bench is an open-source benchmark created by researchers at Princeton and Stanford to measure how well large language models can resolve real-world software engineering issues collected from GitHub. Given a codebase and an issue description, a language model must generate a patch that fixes the problem — making it one of the most concrete and reproducible evaluations of AI coding capability available. The project was accepted as an oral presentation at ICLR 2024 and has since expanded into a family of related benchmarks and tools.

What It Is

SWE-bench frames software engineering as a task: given a repository and a GitHub issue, can a model produce a working patch? The benchmark draws from real issues filed against popular Python projects, making it substantially harder than synthetic coding tasks. The evaluation harness runs candidate patches inside Docker containers to verify correctness in a reproducible environment. The leaderboard at swebench.com tracks resolved-percentage scores across hundreds of model and agent combinations.

Benchmark Variants

The SWE-bench family has grown to cover several evaluation scenarios:

SWE-bench Full — the original 2,294-instance test set of real GitHub issues
SWE-bench Lite — a curated subset designed for less costly evaluation (300 instances)
SWE-bench Verified — 500 instances confirmed solvable by real software engineers, developed in collaboration with OpenAI Preparedness
SWE-bench Multimodal — 517 instances that include visual elements such as screenshots and diagrams, accepted at ICLR 2025
SWE-bench Multilingual — 300 tasks spanning 9 programming languages

Architecture and Evaluation Setup

Evaluation runs entirely inside Docker containers, which the project switched to in June 2024 for reproducibility. The recommended hardware is an x86_64 machine with at least 120 GB of free storage, 16 GB of RAM, and 8 CPU cores. Cloud-based evaluation is also supported via Modal or the companion sb-cli tool that runs evaluations automatically on AWS. The Python package is installable via pip (swebench) and the datasets are hosted on Hugging Face under the princeton-nlp and SWE-bench organizations.

Companion Models and Datasets

The repository ships pre-processed retrieval datasets (BM25 at 13K, 27K, 40K, and 50K token budgets) and fine-tuned SWE-Llama models (7B and 13B, with and without PEFT adapters) to support research into both inference and training. The related SWE-smith toolkit, announced in May 2025, provides a dedicated pipeline for generating synthetic software engineering training data and was used to train SWE-agent-LM-32B, which the project page describes as the open-weight state-of-the-art on SWE-bench Verified as of April 2025.

Update: Multimodal Integration and Leaderboard Activity (2025)

As of January 2025, SWE-bench Multimodal was integrated into the main repository, with test-split evaluation kept private and submissions routed through sb-cli. The leaderboard is actively updated; as of early 2026 the top entries on SWE-bench Verified exceed 76% resolved, with entries from Anthropic, Google, OpenAI, DeepSeek, and open-weight models all represented. The project acknowledges support from Open Philanthropy, AWS, Modal, Andreessen Horowitz, OpenAI, and Anthropic.

Community Discussions

Be the first to start a conversation about SWE-bench

Share your experience with SWE-bench, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully free and open-source under the MIT License. Use, modify, and distribute freely.

MIT License
Full benchmark datasets on HuggingFace
Docker-based evaluation harness
SWE-bench Lite, Verified, Multimodal, Multilingual variants
Pre-processed retrieval datasets

Capabilities

Key Features

Real-world GitHub issue benchmark
Docker-based reproducible evaluation harness
SWE-bench Verified (500 human-confirmed solvable instances)
SWE-bench Lite (300-instance subset for cost-efficient evaluation)
SWE-bench Multimodal (visual software engineering tasks)
SWE-bench Multilingual (9 programming languages)
Public leaderboard with % Resolved metric
Cloud evaluation via Modal and sb-cli (AWS)
Pre-processed BM25 retrieval datasets
Fine-tuned SWE-Llama 7B and 13B models
HuggingFace dataset integration
Custom data collection pipeline for new repositories
Inference support for local and API-based models

Integrations

Docker

HuggingFace Datasets

Modal

AWS

GitHub

OpenAI API

Anthropic API

BM25 retrieval

API Available

View Docs

Back to all tools Suggest an edit

SWE-bench

LLM Evaluations

A benchmark for evaluating large language models on real-world GitHub issues, tasking models to generate patches that resolve described software problems.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source under the MIT License. Use, modify, and distribute freely.

Engagement

ratings

discussions

11views

Available On

macOS

Linux

API

SDK

CLI

Resources

Website Docs GitHub llms.txt

Topics

LLM Evaluations Automated Testing AI Coding Assistants

Alternatives

Artificial Analysis Toolathlon LLM Stats

Developer

SWE-benchPrinceton, NJEst. 2023

Listed May 2026

About SWE-bench

What It Is

Benchmark Variants

The SWE-bench family has grown to cover several evaluation scenarios:

SWE-bench Full — the original 2,294-instance test set of real GitHub issues
SWE-bench Lite — a curated subset designed for less costly evaluation (300 instances)
SWE-bench Verified — 500 instances confirmed solvable by real software engineers, developed in collaboration with OpenAI Preparedness
SWE-bench Multimodal — 517 instances that include visual elements such as screenshots and diagrams, accepted at ICLR 2025
SWE-bench Multilingual — 300 tasks spanning 9 programming languages

Architecture and Evaluation Setup

Companion Models and Datasets

Update: Multimodal Integration and Leaderboard Activity (2025)

Community Discussions

Be the first to start a conversation about SWE-bench

Share your experience with SWE-bench, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully free and open-source under the MIT License. Use, modify, and distribute freely.

MIT License
Full benchmark datasets on HuggingFace
Docker-based evaluation harness
SWE-bench Lite, Verified, Multimodal, Multilingual variants
Pre-processed retrieval datasets

Capabilities

Key Features

Real-world GitHub issue benchmark
Docker-based reproducible evaluation harness
SWE-bench Verified (500 human-confirmed solvable instances)
SWE-bench Lite (300-instance subset for cost-efficient evaluation)
SWE-bench Multimodal (visual software engineering tasks)
SWE-bench Multilingual (9 programming languages)
Public leaderboard with % Resolved metric
Cloud evaluation via Modal and sb-cli (AWS)
Pre-processed BM25 retrieval datasets
Fine-tuned SWE-Llama 7B and 13B models
HuggingFace dataset integration
Custom data collection pipeline for new repositories
Inference support for local and API-based models

Integrations

Docker

HuggingFace Datasets

Modal

AWS

GitHub

OpenAI API

Anthropic API

BM25 retrieval

API Available

View Docs

Back to all tools Suggest an edit