EveryDev.ai
Sign inSubscribe
AI Tools by Topic
  • AI Coding Assistants
  • Agent Frameworks
  • MCP Servers
  • AI Prompt Tools
  • Vibe Coding Tools
  • AI Design Tools
  • AI Database Tools
  • AI Website Builders
  • AI Testing Tools
  • LLM Evaluations
Follow Us
  • X / Twitter
  • LinkedIn
  • Reddit
  • Discord
  • Threads
  • Bluesky
  • Mastodon
  • YouTube
  • GitHub
  • Instagram
Get Started
  • About
  • Editorial Standards
  • Corrections & Disclosures
  • Community Guidelines
  • Advertise
  • Contact Us
  • Newsletter
  • Submit a Tool
  • Start a Discussion
  • Write A Blog
  • Share A Build
  • Terms of Service
  • Privacy Policy
Explore with AI
  • ChatGPT
  • Gemini
  • Claude
  • Grok
  • Perplexity
Agent Experience
  • llms.txt
Theme
With AI, Everyone is a Dev. EveryDev.ai © 2026
Main Menu
  • Tools
  • Developers
  • Topics
  • Discussions
  • Communities
  • News
  • Podcasts
  • Blogs
  • Builds
  • Contests
  • Compare
  • Arena
Create
    Home
    Tools

    2,508+ AI tools

    • New
    • Trending
    • Featured
    • Compare
    • Arena
    Categories
    • Agents1666
    • Coding1214
    • Infrastructure542
    • Marketing451
    • Design437
    • Projects396
    • Research371
    • Analytics339
    • Testing233
    • MCP227
    • Data213
    • Security200
    • Integration170
    • Learning155
    • Communication148
    • Prompts144
    • Extensions137
    • Commerce125
    • Voice122
    • DevOps99
    • Web78
    • Finance21
    1. Home
    2. Tools
    3. DeepSWE
    DeepSWE icon

    DeepSWE

    LLM Evaluations

    A benchmark for measuring frontier coding agents on 113 original, long-horizon software engineering tasks drawn from active open-source repositories across 5 languages.

    Visit Website

    At a Glance

    Pricing
    Open Source

    Free and open-source benchmark available on GitHub.

    Engagement

    Available On

    CLI
    API

    Resources

    WebsiteDocsGitHubllms.txt

    Topics

    LLM EvaluationsAI Coding AssistantsAgent Harness

    Alternatives

    ProgramBenchInferenceBenchmdarena
    Developer
    Datacurve AIDatacurve AI builds tools and benchmarks for evaluating and…

    Listed May 2026

    About DeepSWE

    DeepSWE is a benchmark created by Datacurve AI to evaluate frontier coding agents on original, long-horizon software engineering tasks. It covers 113 tasks across TypeScript, Go, Python, JavaScript, and Rust, drawn from 91 active open-source repositories, with isolated environments and program-based verifiers. The project is hosted on GitHub and pairs with Pier, a sandboxed coding-agent evaluation framework.

    What It Is

    DeepSWE is a software engineering benchmark designed to differentiate the performance of today's top AI coding agents in scenarios where existing public benchmarks have begun to saturate. Rather than adapting tasks from existing commits or pull requests, DeepSWE tasks are written from scratch to avoid contamination from model pretraining data. Each task includes a structured format with metadata, an agent-facing instruction, a reproducible Docker environment, a hand-written verifier, and a held-out reference solution for human and AI reviewers.

    What Separates It From Other Benchmarks

    The DeepSWE project page describes four advances over existing public benchmarks:

    • Contamination-free: Tasks are written from scratch, not adapted from existing commits or PRs.
    • High diversity: Tasks span 91 repositories across 5 programming languages.
    • Real-world complexity: The project page states that prompts are roughly half the length of SWE-bench Pro's, yet solutions require 5.5x more code and approximately 2x more output tokens.
    • Reliable verification: Verifiers are hand-written to test software behavior rather than implementation details, accepting any solution with correct observable behavior regardless of internal symbol names or structure.

    Leaderboard and Results

    DeepSWE publishes a public leaderboard showing scores for 12 frontier models, all evaluated using mini-swe-agent. According to the leaderboard, GPT-5.5 leads at 70%±4%, followed by GPT-5.4 at 56%±5% and Claude Opus 4.7 at 54%±5%. Lower-ranked models include Gemini 3.5 Flash at 28%±4% and DeepSeek V4 Pro at 8%±2%. All scores were produced with Pier running mini-swe-agent on Modal.

    Task Format and Evaluation Harness

    Tasks use the Harbor task format, with each task directory containing a task.toml for metadata, an instruction.md for the agent prompt, an environment/ Dockerfile, a tests/ verifier, and a solution/ reference patch. The evaluation harness, Pier, is a Harbor-compatible framework that adds per-agent network allowlists for air-gapped tasks, more complete trajectory metadata, a trajectory viewer, and a pier critique run command for analyzing agent trajectories. Pier supports running agents including mini-swe-agent, claude-code, codex, gemini-cli, and opencode, and can parallelize runs on Modal.

    Current Status

    The repository was created in May 2026 and last updated in late May 2026, indicating it is a recently launched and actively maintained project. It has 71 stars and 2 forks on GitHub. No license is specified in the repository at this time.

    DeepSWE - 1

    Community Discussions

    Be the first to start a conversation about DeepSWE

    Share your experience with DeepSWE, ask questions, or help others learn from your insights.

    Pricing

    OPEN SOURCE

    Open Source

    Free and open-source benchmark available on GitHub.

    • 113 original software engineering tasks
    • Tasks across TypeScript, Go, Python, JavaScript, and Rust
    • Isolated Docker environments
    • Hand-written verifiers
    • Reference solutions for reviewers

    Capabilities

    Key Features

    • 113 original long-horizon software engineering tasks
    • Tasks span TypeScript, Go, Python, JavaScript, and Rust
    • 91 open-source repositories covered
    • Contamination-free task design (written from scratch)
    • Hand-written program-based verifiers testing behavior not implementation
    • Isolated Docker environments per task
    • Held-out reference solutions for reviewers
    • Public leaderboard with confidence intervals
    • Pier evaluation harness with per-agent network allowlists
    • Support for mini-swe-agent, claude-code, codex, gemini-cli, opencode
    • Parallel sandbox execution on Modal
    • Trajectory metadata and viewer
    • pier critique run for agent trajectory analysis
    • Deterministic random subset sampling for partial runs

    Integrations

    Anthropic Claude
    OpenAI GPT
    Google Gemini
    Moonshot Kimi
    DeepSeek
    mini-swe-agent
    claude-code
    codex
    gemini-cli
    opencode
    Modal
    Harbor framework
    Pier
    API Available
    View Docs

    Reviews & Ratings

    No ratings yet

    Be the first to rate DeepSWE and help others make informed decisions.

    Developer

    Datacurve AI

    Datacurve AI builds tools and benchmarks for evaluating and improving frontier coding agents. The team develops DeepSWE, a contamination-free long-horizon software engineering benchmark, and Pier, a sandboxed coding-agent evaluation framework. Their work focuses on rigorous, real-world measurement of AI coding capabilities across multiple programming languages and open-source repositories.

    Read more about Datacurve AI
    WebsiteGitHub
    1 tool in directory

    Similar Tools

    ProgramBench icon

    ProgramBench

    A benchmark that tests whether AI agents can rebuild real-world programs from scratch given only a compiled binary and its documentation, with no access to source code.

    InferenceBench icon

    InferenceBench

    An open-source benchmark that evaluates whether frontier AI coding agents can optimize LLM serving workloads under a fixed compute budget across four inference scenarios.

    mdarena icon

    mdarena

    Benchmark your CLAUDE.md files against real merged PRs to measure whether your AI agent context files help or hurt performance and token costs.

    Browse all tools

    Related Topics

    LLM Evaluations

    Platforms and frameworks for evaluating, testing, and benchmarking LLM systems and AI applications. These tools provide evaluators and evaluation models to score AI outputs, measure hallucinations, assess RAG quality, detect failures, and optimize model performance. Features include automated testing with LLM-as-a-judge metrics, component-level evaluation with tracing, regression testing in CI/CD pipelines, custom evaluator creation, dataset curation, and real-time monitoring of production systems. Teams use these solutions to validate prompt effectiveness, compare models side-by-side, ensure answer correctness and relevance, identify bias and toxicity, prevent PII leakage, and continuously improve AI product quality through experiments, benchmarks, and performance analytics.

    86 tools

    AI Coding Assistants

    AI tools that help write, edit, and understand code with intelligent suggestions.

    475 tools

    Agent Harness

    Infrastructure, orchestrators, and task runners that wrap around LLM coding agents — covering session management, context delivery, worktree isolation, architecture enforcement, and issue-to-PR pipelines.

    86 tools
    Browse all topics
    Back to all tools
    Discussions