mdarena
Benchmark your CLAUDE.md files against real merged PRs to measure whether your AI agent context files help or hurt performance and token costs.
At a Glance
Fully free and open source under the MIT License. Install via pip and use without restrictions.
Engagement
Available On
Alternatives
Listed Apr 2026
About mdarena
mdarena is a CLI tool that lets you empirically benchmark CLAUDE.md (and AGENTS.md) files against tasks derived from your own repository's merged pull requests. Instead of writing agent context files blindly, mdarena mines historical PRs, runs Claude Code under different conditions, and grades patches against the real gold diff — the same way SWE-bench does it. It supports statistical significance testing, monorepo structures, and SWE-bench compatibility.
mdarena mine: Fetches merged PRs from a GitHub repo and builds a reproducible task set, with auto-detection of test commands from CI/CD configs and package files.mdarena run: Checks out the repo at the pre-PR commit, strips or injects CLAUDE.md files per condition, runs Claude Code, and captures the resulting git diff and test results.mdarena report: Compares agent-generated patches against the gold PR diff using test pass/fail, file/hunk overlap, token cost, and paired t-test statistical significance.- Baseline comparison: Automatically runs a stripped baseline (no CLAUDE.md) alongside your test conditions so you can see the true delta.
- Monorepo support: Pass a directory of CLAUDE.md files mirroring your repo structure to benchmark per-directory instruction trees.
- SWE-bench compatibility: Import SWE-bench Lite tasks or export your own task set as SWE-bench JSONL for cross-benchmark comparisons.
- Benchmark integrity: Uses
git archiveto create history-free checkouts, preventing the agent from walking future commits via git tags — closing the exploit seen in Claude 4 Sonnet on SWE-bench. - Security isolation: Each task runs in an isolated temp directory under
/tmp; test commands and Claude Code are sandboxed per task. - Open source (MIT): Fully open source, installable via
pip install mdarena, and extensible for custom grading or CI integration.
Community Discussions
Be the first to start a conversation about mdarena
Share your experience with mdarena, ask questions, or help others learn from your insights.
Pricing
Open Source (MIT)
Fully free and open source under the MIT License. Install via pip and use without restrictions.
- Mine merged PRs into benchmark task sets
- Benchmark multiple CLAUDE.md files
- Test pass/fail and diff overlap grading
- SWE-bench import/export
- Monorepo support
Capabilities
Key Features
- Mine merged PRs into a reproducible benchmark task set
- Benchmark multiple CLAUDE.md files head-to-head
- Auto-detect test commands from CI/CD and package files
- Grade patches via test pass/fail and diff overlap scoring
- Statistical significance via paired t-test
- Monorepo support with directory-based CLAUDE.md trees
- SWE-bench import and export compatibility
- History-free git checkouts to prevent benchmark exploitation
- Baseline condition strips all CLAUDE.md files automatically
- Token cost and usage tracking per condition
