Main Menu
  • Tools
  • Developers
  • Topics
  • Discussions
  • News
  • Blogs
  • Builds
  • Contests
  • Compare
Create
    EveryDev.ai
    Sign inSubscribe
    Home
    Tools

    1,888+ AI tools

    • New
    • Trending
    • Featured
    • Compare
    Categories
    • Agents891
    • Coding869
    • Infrastructure377
    • Marketing357
    • Design302
    • Research276
    • Projects271
    • Analytics266
    • Testing160
    • Integration157
    • Data150
    • Security131
    • MCP125
    • Learning124
    • Extensions108
    • Communication107
    • Prompts100
    • Voice90
    • Commerce89
    • DevOps70
    • Web66
    • Finance17
    1. Home
    2. Tools
    3. mdarena
    mdarena icon

    mdarena

    LLM Evaluations

    Benchmark your CLAUDE.md files against real merged PRs to measure whether your AI agent context files help or hurt performance and token costs.

    Visit Website

    At a Glance

    Pricing
    Open Source

    Fully free and open source under the MIT License. Install via pip and use without restrictions.

    Engagement

    Available On

    API
    CLI

    Resources

    WebsiteDocsGitHubllms.txt

    Topics

    LLM EvaluationsAI Coding AssistantsAutomated Testing

    Alternatives

    GiskardDeepEvalPatronus AI
    Developer
    HudsonGriHudsonGri builds open-source developer tooling focused on AI…

    Listed Apr 2026

    About mdarena

    mdarena is a CLI tool that lets you empirically benchmark CLAUDE.md (and AGENTS.md) files against tasks derived from your own repository's merged pull requests. Instead of writing agent context files blindly, mdarena mines historical PRs, runs Claude Code under different conditions, and grades patches against the real gold diff — the same way SWE-bench does it. It supports statistical significance testing, monorepo structures, and SWE-bench compatibility.

    • mdarena mine: Fetches merged PRs from a GitHub repo and builds a reproducible task set, with auto-detection of test commands from CI/CD configs and package files.
    • mdarena run: Checks out the repo at the pre-PR commit, strips or injects CLAUDE.md files per condition, runs Claude Code, and captures the resulting git diff and test results.
    • mdarena report: Compares agent-generated patches against the gold PR diff using test pass/fail, file/hunk overlap, token cost, and paired t-test statistical significance.
    • Baseline comparison: Automatically runs a stripped baseline (no CLAUDE.md) alongside your test conditions so you can see the true delta.
    • Monorepo support: Pass a directory of CLAUDE.md files mirroring your repo structure to benchmark per-directory instruction trees.
    • SWE-bench compatibility: Import SWE-bench Lite tasks or export your own task set as SWE-bench JSONL for cross-benchmark comparisons.
    • Benchmark integrity: Uses git archive to create history-free checkouts, preventing the agent from walking future commits via git tags — closing the exploit seen in Claude 4 Sonnet on SWE-bench.
    • Security isolation: Each task runs in an isolated temp directory under /tmp; test commands and Claude Code are sandboxed per task.
    • Open source (MIT): Fully open source, installable via pip install mdarena, and extensible for custom grading or CI integration.
    mdarena - 1

    Community Discussions

    Be the first to start a conversation about mdarena

    Share your experience with mdarena, ask questions, or help others learn from your insights.

    Pricing

    OPEN SOURCE

    Open Source (MIT)

    Fully free and open source under the MIT License. Install via pip and use without restrictions.

    • Mine merged PRs into benchmark task sets
    • Benchmark multiple CLAUDE.md files
    • Test pass/fail and diff overlap grading
    • SWE-bench import/export
    • Monorepo support

    Capabilities

    Key Features

    • Mine merged PRs into a reproducible benchmark task set
    • Benchmark multiple CLAUDE.md files head-to-head
    • Auto-detect test commands from CI/CD and package files
    • Grade patches via test pass/fail and diff overlap scoring
    • Statistical significance via paired t-test
    • Monorepo support with directory-based CLAUDE.md trees
    • SWE-bench import and export compatibility
    • History-free git checkouts to prevent benchmark exploitation
    • Baseline condition strips all CLAUDE.md files automatically
    • Token cost and usage tracking per condition

    Integrations

    Claude Code (claude CLI)
    GitHub CLI (gh)
    SWE-bench
    GitHub Actions / CI workflows
    pyproject.toml
    package.json
    Cargo.toml
    go.mod
    API Available
    View Docs

    Reviews & Ratings

    No ratings yet

    Be the first to rate mdarena and help others make informed decisions.

    Developer

    HudsonGri

    HudsonGri builds open-source developer tooling focused on AI agent evaluation. The mdarena project provides empirical benchmarking for CLAUDE.md context files using real repository history. The project is MIT-licensed and hosted on GitHub.

    Read more about HudsonGri
    WebsiteGitHub
    1 tool in directory

    Similar Tools

    Giskard icon

    Giskard

    Automated testing platform for LLM agents that detects hallucinations, security vulnerabilities, and quality issues through continuous red teaming.

    DeepEval icon

    DeepEval

    DeepEval is an open-source LLM evaluation framework that enables developers to build reliable evaluation pipelines and test any AI system with 50+ research-backed metrics.

    Patronus AI icon

    Patronus AI

    Automated evaluation and monitoring platform that scores, detects failures, and optimizes LLMs and AI agents using evaluation models, experiments, traces, and an API/SDK ecosystem.

    Browse all tools

    Related Topics

    LLM Evaluations

    Platforms and frameworks for evaluating, testing, and benchmarking LLM systems and AI applications. These tools provide evaluators and evaluation models to score AI outputs, measure hallucinations, assess RAG quality, detect failures, and optimize model performance. Features include automated testing with LLM-as-a-judge metrics, component-level evaluation with tracing, regression testing in CI/CD pipelines, custom evaluator creation, dataset curation, and real-time monitoring of production systems. Teams use these solutions to validate prompt effectiveness, compare models side-by-side, ensure answer correctness and relevance, identify bias and toxicity, prevent PII leakage, and continuously improve AI product quality through experiments, benchmarks, and performance analytics.

    53 tools

    AI Coding Assistants

    AI tools that help write, edit, and understand code with intelligent suggestions.

    356 tools

    Automated Testing

    AI-powered platforms that automate end-to-end testing processes with intelligent test case generation, execution, and reporting for faster, more reliable software delivery.

    79 tools
    Browse all topics
    Back to all tools
    Explore AI Tools
    • AI Coding Assistants
    • Agent Frameworks
    • MCP Servers
    • AI Prompt Tools
    • Vibe Coding Tools
    • AI Design Tools
    • AI Database Tools
    • AI Website Builders
    • AI Testing Tools
    • LLM Evaluations
    Follow Us
    • X / Twitter
    • LinkedIn
    • Reddit
    • Discord
    • Threads
    • Bluesky
    • Mastodon
    • YouTube
    • GitHub
    • Instagram
    Get Started
    • About
    • Editorial Standards
    • Corrections & Disclosures
    • Community Guidelines
    • Advertise
    • Contact Us
    • Newsletter
    • Submit a Tool
    • Start a Discussion
    • Write A Blog
    • Share A Build
    • Terms of Service
    • Privacy Policy
    Explore with AI
    • ChatGPT
    • Gemini
    • Claude
    • Grok
    • Perplexity
    Agent Experience
    • llms.txt
    Theme
    With AI, Everyone is a Dev. EveryDev.ai © 2026
    Discussions