EveryDev.ai
Sign inSubscribe
Home
Tools

2,690+ AI tools

  • New
  • Trending
  • Featured
  • Compare
  • Arena
Categories
  • Agents1815
  • Coding1295
  • Infrastructure600
  • Marketing467
  • Projects433
  • Research403
  • Analytics351
  • Design338
  • Security243
  • MCP242
  • Testing238
  • Data230
  • Integration178
  • Prompts160
  • Learning159
  • Communication154
  • Extensions150
  • Voice130
  • Commerce125
  • DevOps108
  • Web80
  • Finance21
AI Tools by Topic
  • AI Coding Assistants
  • Agent Frameworks
  • MCP Servers
  • AI Prompt Tools
  • Vibe Coding Tools
  • AI Design Tools
  • AI Database Tools
  • AI Website Builders
  • AI Testing Tools
  • LLM Evaluations
Follow Us
  • X / Twitter
  • LinkedIn
  • Reddit
  • Discord
  • Threads
  • Bluesky
  • Mastodon
  • YouTube
  • GitHub
  • Instagram
Get Started
  • About
  • Editorial Standards
  • Corrections & Disclosures
  • Community Guidelines
  • Advertise
  • Contact Us
  • Newsletter
  • Submit a Tool
  • Start a Discussion
  • Write A Blog
  • Share A Build
  • Terms of Service
  • Privacy Policy
Explore with AI
  • ChatGPT
  • Gemini
  • Claude
  • Grok
  • Perplexity
Agent Experience
  • llms.txt
Theme
With AI, Everyone is a Dev. EveryDev.ai © 2026
    1. Home
    2. Tools
    3. DeepSWE
    DeepSWE icon

    DeepSWE

    LLM Evaluations

    A benchmark for measuring frontier coding agents on 113 original, long-horizon software engineering tasks drawn from active open-source repositories across 5 languages.

    Visit Website

    At a Glance

    Pricing
    Open Source

    Free and open-source benchmark available on GitHub.

    Engagement

    Available On

    CLI
    API

    Resources

    WebsiteDocsGitHubllms.txt

    Topics

    LLM EvaluationsAI Coding AssistantsAgent Harness

    Alternatives

    BridgeBenchharness-kitAgent Reading Test
    Developer
    Datacurve AISan Francisco, CAEst. 2024$17.7M raised

    Listed May 2026

    About DeepSWE

    DeepSWE is a benchmark created by Datacurve AI to evaluate frontier coding agents on original, long-horizon software engineering tasks. It covers 113 tasks across TypeScript, Go, Python, JavaScript, and Rust, drawn from 91 active open-source repositories, with isolated environments and program-based verifiers. The project is hosted on GitHub and pairs with Pier, a sandboxed coding-agent evaluation framework.

    What It Is

    DeepSWE is a software engineering benchmark designed to differentiate the performance of today's top AI coding agents in scenarios where existing public benchmarks have begun to saturate. Rather than adapting tasks from existing commits or pull requests, DeepSWE tasks are written from scratch to avoid contamination from model pretraining data. Each task includes a structured format with metadata, an agent-facing instruction, a reproducible Docker environment, a hand-written verifier, and a held-out reference solution for human and AI reviewers.

    What Separates It From Other Benchmarks

    The DeepSWE project page describes four advances over existing public benchmarks:

    • Contamination-free: Tasks are written from scratch, not adapted from existing commits or PRs.
    • High diversity: Tasks span 91 repositories across 5 programming languages.
    • Real-world complexity: The project page states that prompts are roughly half the length of SWE-bench Pro's, yet solutions require 5.5x more code and approximately 2x more output tokens.
    • Reliable verification: Verifiers are hand-written to test software behavior rather than implementation details, accepting any solution with correct observable behavior regardless of internal symbol names or structure.

    Leaderboard and Results

    DeepSWE publishes a public leaderboard showing scores for 12 frontier models, all evaluated using mini-swe-agent. According to the leaderboard, GPT-5.5 leads at 70%±4%, followed by GPT-5.4 at 56%±5% and Claude Opus 4.7 at 54%±5%. Lower-ranked models include Gemini 3.5 Flash at 28%±4% and DeepSeek V4 Pro at 8%±2%. All scores were produced with Pier running mini-swe-agent on Modal.

    Task Format and Evaluation Harness

    Tasks use the Harbor task format, with each task directory containing a task.toml for metadata, an instruction.md for the agent prompt, an environment/ Dockerfile, a tests/ verifier, and a solution/ reference patch. The evaluation harness, Pier, is a Harbor-compatible framework that adds per-agent network allowlists for air-gapped tasks, more complete trajectory metadata, a trajectory viewer, and a pier critique run command for analyzing agent trajectories. Pier supports running agents including mini-swe-agent, claude-code, codex, gemini-cli, and opencode, and can parallelize runs on Modal.

    Current Status

    The repository was created in May 2026 and last updated in late May 2026, indicating it is a recently launched and actively maintained project. It has 71 stars and 2 forks on GitHub. No license is specified in the repository at this time.

    DeepSWE - 1

    Community Discussions

    Be the first to start a conversation about DeepSWE

    Share your experience with DeepSWE, ask questions, or help others learn from your insights.

    Pricing

    OPEN SOURCE

    Open Source

    Free and open-source benchmark available on GitHub.

    • 113 original software engineering tasks
    • Tasks across TypeScript, Go, Python, JavaScript, and Rust
    • Isolated Docker environments
    • Hand-written verifiers
    • Reference solutions for reviewers

    Capabilities

    Key Features

    • 113 original long-horizon software engineering tasks
    • Tasks span TypeScript, Go, Python, JavaScript, and Rust
    • 91 open-source repositories covered
    • Contamination-free task design (written from scratch)
    • Hand-written program-based verifiers testing behavior not implementation
    • Isolated Docker environments per task
    • Held-out reference solutions for reviewers
    • Public leaderboard with confidence intervals
    • Pier evaluation harness with per-agent network allowlists
    • Support for mini-swe-agent, claude-code, codex, gemini-cli, opencode
    • Parallel sandbox execution on Modal
    • Trajectory metadata and viewer
    • pier critique run for agent trajectory analysis
    • Deterministic random subset sampling for partial runs

    Integrations

    Anthropic Claude
    OpenAI GPT
    Google Gemini
    Moonshot Kimi
    DeepSeek
    mini-swe-agent
    claude-code
    codex
    gemini-cli
    opencode
    Modal
    Harbor framework
    Pier
    API Available
    View Docs

    Reviews & Ratings

    No ratings yet

    Be the first to rate DeepSWE and help others make informed decisions.

    Developer

    Datacurve AI

    Datacurve AI builds tools and benchmarks for evaluating and improving frontier coding agents. The team develops DeepSWE, a contamination-free long-horizon software engineering benchmark, and Pier, a sandboxed coding-agent evaluation framework. Their work focuses on rigorous, real-world measurement of AI coding capabilities across multiple programming languages and open-source repositories.

    Founded 2024
    San Francisco, CA
    $17.7M raised
    47 employees

    Used by

    Together AI (Partnership)
    Major foundation-model labs (Confidenti…
    Read more about Datacurve AI
    WebsiteGitHub
    1 tool in directory

    Similar Tools

    BridgeBench icon

    BridgeBench

    BridgeBench ranks AI coding models across UI generation, security, refactoring, hallucination, debugging, and speed benchmarks.

    harness-kit icon

    harness-kit

    A Python toolkit for building and evaluating AI agent harnesses, enabling structured testing and benchmarking of LLM-based agents.

    Agent Reading Test icon

    Agent Reading Test

    A benchmark that tests how well AI coding agents can read web content, surfacing silent failure modes like truncation, CSS burial, SPA shells, and broken markdown parsing.

    Browse all tools

    Related Topics

    LLM Evaluations

    Platforms and frameworks for evaluating, testing, and benchmarking LLM systems and AI applications. These tools provide evaluators and evaluation models to score AI outputs, measure hallucinations, assess RAG quality, detect failures, and optimize model performance. Features include automated testing with LLM-as-a-judge metrics, component-level evaluation with tracing, regression testing in CI/CD pipelines, custom evaluator creation, dataset curation, and real-time monitoring of production systems. Teams use these solutions to validate prompt effectiveness, compare models side-by-side, ensure answer correctness and relevance, identify bias and toxicity, prevent PII leakage, and continuously improve AI product quality through experiments, benchmarks, and performance analytics.

    87 tools

    AI Coding Assistants

    AI tools that help write, edit, and understand code with intelligent suggestions.

    514 tools

    Agent Harness

    Infrastructure, orchestrators, and task runners that wrap around LLM coding agents — covering session management, context delivery, worktree isolation, architecture enforcement, and issue-to-PR pipelines.

    94 tools
    Browse all topics
    Back to all tools
    27views
    Discussions