EveryDev.ai
Sign inSubscribe
Explore AI Tools
  • AI Coding Assistants
  • Agent Frameworks
  • MCP Servers
  • AI Prompt Tools
  • Vibe Coding Tools
  • AI Design Tools
  • AI Database Tools
  • AI Website Builders
  • AI Testing Tools
  • LLM Evaluations
Follow Us
  • X / Twitter
  • LinkedIn
  • Reddit
  • Discord
  • Threads
  • Bluesky
  • Mastodon
  • YouTube
  • GitHub
  • Instagram
Get Started
  • About
  • Editorial Standards
  • Corrections & Disclosures
  • Community Guidelines
  • Advertise
  • Contact Us
  • Newsletter
  • Submit a Tool
  • Start a Discussion
  • Write A Blog
  • Share A Build
  • Terms of Service
  • Privacy Policy
Explore with AI
  • ChatGPT
  • Gemini
  • Claude
  • Grok
  • Perplexity
Agent Experience
  • llms.txt
Theme
With AI, Everyone is a Dev. EveryDev.ai © 2026
Main Menu
  • Tools
  • Developers
  • Topics
  • Discussions
  • Communities
  • News
  • Podcasts
  • Blogs
  • Builds
  • Contests
  • Compare
  • Arena
Create
    Home
    Tools

    2,480+ AI tools

    • New
    • Trending
    • Featured
    • Compare
    • Arena
    Categories
    • Agents1596
    • Coding1181
    • Infrastructure526
    • Marketing447
    • Design427
    • Projects384
    • Research357
    • Analytics331
    • Testing221
    • MCP216
    • Data205
    • Security196
    • Integration169
    • Learning154
    • Communication146
    • Prompts140
    • Extensions137
    • Commerce123
    • Voice122
    • DevOps99
    • Web77
    • Finance21
    1. Home
    2. Tools
    3. SWE-bench
    SWE-bench icon

    SWE-bench

    LLM Evaluations

    A benchmark for evaluating large language models on real-world GitHub issues, tasking models to generate patches that resolve described software problems.

    Visit Website

    At a Glance

    Pricing
    Open Source

    Fully free and open-source under the MIT License. Use, modify, and distribute freely.

    Engagement

    Available On

    macOS
    Linux
    API
    SDK
    CLI

    Resources

    WebsiteDocsGitHubllms.txt

    Topics

    LLM EvaluationsAutomated TestingAI Coding Assistants

    Alternatives

    mdarenaProgramBenchToolathlon
    Developer
    SWE-benchPrinceton, NJEst. 2023

    Listed May 2026

    About SWE-bench

    SWE-bench is an open-source benchmark created by researchers at Princeton and Stanford to measure how well large language models can resolve real-world software engineering issues collected from GitHub. Given a codebase and an issue description, a language model must generate a patch that fixes the problem — making it one of the most concrete and reproducible evaluations of AI coding capability available. The project was accepted as an oral presentation at ICLR 2024 and has since expanded into a family of related benchmarks and tools.

    What It Is

    SWE-bench frames software engineering as a task: given a repository and a GitHub issue, can a model produce a working patch? The benchmark draws from real issues filed against popular Python projects, making it substantially harder than synthetic coding tasks. The evaluation harness runs candidate patches inside Docker containers to verify correctness in a reproducible environment. The leaderboard at swebench.com tracks resolved-percentage scores across hundreds of model and agent combinations.

    Benchmark Variants

    The SWE-bench family has grown to cover several evaluation scenarios:

    • SWE-bench Full — the original 2,294-instance test set of real GitHub issues
    • SWE-bench Lite — a curated subset designed for less costly evaluation (300 instances)
    • SWE-bench Verified — 500 instances confirmed solvable by real software engineers, developed in collaboration with OpenAI Preparedness
    • SWE-bench Multimodal — 517 instances that include visual elements such as screenshots and diagrams, accepted at ICLR 2025
    • SWE-bench Multilingual — 300 tasks spanning 9 programming languages

    Architecture and Evaluation Setup

    Evaluation runs entirely inside Docker containers, which the project switched to in June 2024 for reproducibility. The recommended hardware is an x86_64 machine with at least 120 GB of free storage, 16 GB of RAM, and 8 CPU cores. Cloud-based evaluation is also supported via Modal or the companion sb-cli tool that runs evaluations automatically on AWS. The Python package is installable via pip (swebench) and the datasets are hosted on Hugging Face under the princeton-nlp and SWE-bench organizations.

    Companion Models and Datasets

    The repository ships pre-processed retrieval datasets (BM25 at 13K, 27K, 40K, and 50K token budgets) and fine-tuned SWE-Llama models (7B and 13B, with and without PEFT adapters) to support research into both inference and training. The related SWE-smith toolkit, announced in May 2025, provides a dedicated pipeline for generating synthetic software engineering training data and was used to train SWE-agent-LM-32B, which the project page describes as the open-weight state-of-the-art on SWE-bench Verified as of April 2025.

    Update: Multimodal Integration and Leaderboard Activity (2025)

    As of January 2025, SWE-bench Multimodal was integrated into the main repository, with test-split evaluation kept private and submissions routed through sb-cli. The leaderboard is actively updated; as of early 2026 the top entries on SWE-bench Verified exceed 76% resolved, with entries from Anthropic, Google, OpenAI, DeepSeek, and open-weight models all represented. The project acknowledges support from Open Philanthropy, AWS, Modal, Andreessen Horowitz, OpenAI, and Anthropic.

    SWE-bench - 1

    Community Discussions

    Be the first to start a conversation about SWE-bench

    Share your experience with SWE-bench, ask questions, or help others learn from your insights.

    Pricing

    OPEN SOURCE

    Open Source

    Fully free and open-source under the MIT License. Use, modify, and distribute freely.

    • MIT License
    • Full benchmark datasets on HuggingFace
    • Docker-based evaluation harness
    • SWE-bench Lite, Verified, Multimodal, Multilingual variants
    • Pre-processed retrieval datasets

    Capabilities

    Key Features

    • Real-world GitHub issue benchmark
    • Docker-based reproducible evaluation harness
    • SWE-bench Verified (500 human-confirmed solvable instances)
    • SWE-bench Lite (300-instance subset for cost-efficient evaluation)
    • SWE-bench Multimodal (visual software engineering tasks)
    • SWE-bench Multilingual (9 programming languages)
    • Public leaderboard with % Resolved metric
    • Cloud evaluation via Modal and sb-cli (AWS)
    • Pre-processed BM25 retrieval datasets
    • Fine-tuned SWE-Llama 7B and 13B models
    • HuggingFace dataset integration
    • Custom data collection pipeline for new repositories
    • Inference support for local and API-based models

    Integrations

    Docker
    HuggingFace Datasets
    Modal
    AWS
    GitHub
    OpenAI API
    Anthropic API
    BM25 retrieval
    API Available
    View Docs

    Reviews & Ratings

    No ratings yet

    Be the first to rate SWE-bench and help others make informed decisions.

    Developer

    SWE-bench Team

    SWE-bench builds open-source benchmarks and tooling for evaluating large language models on real-world software engineering tasks. The project originates from Princeton and Stanford researchers, led by Carlos E. Jimenez and John Yang. It produces benchmark datasets, evaluation harnesses, fine-tuned models, and companion tools like SWE-agent and SWE-smith to advance AI software engineering research.

    Founded 2023
    Princeton, NJ
    15 employees

    Used by

    OpenAI
    Anthropic
    Google DeepMind
    Meta AI
    +1 more
    Read more about SWE-bench Team
    WebsiteGitHubX / Twitter
    2 tools in directory

    Similar Tools

    mdarena icon

    mdarena

    Benchmark your CLAUDE.md files against real merged PRs to measure whether your AI agent context files help or hurt performance and token costs.

    ProgramBench icon

    ProgramBench

    A benchmark that tests whether AI agents can rebuild real-world programs from scratch given only a compiled binary and its documentation, with no access to source code.

    Toolathlon icon

    Toolathlon

    Toolathlon is an open-source benchmark for evaluating language agents on diverse, realistic, and long-horizon tool-use tasks across 32 software applications and 604 tools.

    Browse all tools

    Related Topics

    LLM Evaluations

    Platforms and frameworks for evaluating, testing, and benchmarking LLM systems and AI applications. These tools provide evaluators and evaluation models to score AI outputs, measure hallucinations, assess RAG quality, detect failures, and optimize model performance. Features include automated testing with LLM-as-a-judge metrics, component-level evaluation with tracing, regression testing in CI/CD pipelines, custom evaluator creation, dataset curation, and real-time monitoring of production systems. Teams use these solutions to validate prompt effectiveness, compare models side-by-side, ensure answer correctness and relevance, identify bias and toxicity, prevent PII leakage, and continuously improve AI product quality through experiments, benchmarks, and performance analytics.

    82 tools

    Automated Testing

    AI-powered platforms that automate end-to-end testing processes with intelligent test case generation, execution, and reporting for faster, more reliable software delivery.

    91 tools

    AI Coding Assistants

    AI tools that help write, edit, and understand code with intelligent suggestions.

    465 tools
    Browse all topics
    Back to all tools
    Discussions