mdarena

Name: mdarena
Availability: OnlineOnly
Author: HudsonGri

Benchmark your CLAUDE.md files against real merged PRs to measure whether your AI agent context files help or hurt performance and token costs.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open source under the MIT License. Install via pip and use without restrictions.

Engagement

Available On

API

CLI

HudsonGriHudsonGri builds open-source developer tooling focused on AI…

Listed Apr 2026

About mdarena

mdarena is a CLI tool that lets you empirically benchmark CLAUDE.md (and AGENTS.md) files against tasks derived from your own repository's merged pull requests. Instead of writing agent context files blindly, mdarena mines historical PRs, runs Claude Code under different conditions, and grades patches against the real gold diff — the same way SWE-bench does it. It supports statistical significance testing, monorepo structures, and SWE-bench compatibility.

mdarena mine: Fetches merged PRs from a GitHub repo and builds a reproducible task set, with auto-detection of test commands from CI/CD configs and package files.
mdarena run: Checks out the repo at the pre-PR commit, strips or injects CLAUDE.md files per condition, runs Claude Code, and captures the resulting git diff and test results.
mdarena report: Compares agent-generated patches against the gold PR diff using test pass/fail, file/hunk overlap, token cost, and paired t-test statistical significance.
Baseline comparison: Automatically runs a stripped baseline (no CLAUDE.md) alongside your test conditions so you can see the true delta.
Monorepo support: Pass a directory of CLAUDE.md files mirroring your repo structure to benchmark per-directory instruction trees.
SWE-bench compatibility: Import SWE-bench Lite tasks or export your own task set as SWE-bench JSONL for cross-benchmark comparisons.
Benchmark integrity: Uses git archive to create history-free checkouts, preventing the agent from walking future commits via git tags — closing the exploit seen in Claude 4 Sonnet on SWE-bench.
Security isolation: Each task runs in an isolated temp directory under /tmp; test commands and Claude Code are sandboxed per task.
Open source (MIT): Fully open source, installable via pip install mdarena, and extensible for custom grading or CI integration.

Community Discussions

Be the first to start a conversation about mdarena

Share your experience with mdarena, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source (MIT)

Fully free and open source under the MIT License. Install via pip and use without restrictions.

Mine merged PRs into benchmark task sets
Benchmark multiple CLAUDE.md files
Test pass/fail and diff overlap grading
SWE-bench import/export
Monorepo support

Capabilities

Key Features

Mine merged PRs into a reproducible benchmark task set
Benchmark multiple CLAUDE.md files head-to-head
Auto-detect test commands from CI/CD and package files
Grade patches via test pass/fail and diff overlap scoring
Statistical significance via paired t-test
Monorepo support with directory-based CLAUDE.md trees
SWE-bench import and export compatibility
History-free git checkouts to prevent benchmark exploitation
Baseline condition strips all CLAUDE.md files automatically
Token cost and usage tracking per condition

Integrations

Claude Code (claude CLI)

GitHub CLI (gh)

SWE-bench

GitHub Actions / CI workflows

pyproject.toml

package.json

Cargo.toml

go.mod

API Available

View Docs

Back to all tools

mdarena

LLM Evaluations

Benchmark your CLAUDE.md files against real merged PRs to measure whether your AI agent context files help or hurt performance and token costs.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open source under the MIT License. Install via pip and use without restrictions.

Engagement

9views

Discussions

Available On

API

CLI

Resources

Website Docs GitHub llms.txt

Topics

LLM Evaluations AI Coding Assistants Automated Testing

Alternatives

Toolathlon Giskard agent-skills-eval

Developer

HudsonGriHudsonGri builds open-source developer tooling focused on AI…

Listed Apr 2026

About mdarena

mdarena mine: Fetches merged PRs from a GitHub repo and builds a reproducible task set, with auto-detection of test commands from CI/CD configs and package files.
mdarena run: Checks out the repo at the pre-PR commit, strips or injects CLAUDE.md files per condition, runs Claude Code, and captures the resulting git diff and test results.
mdarena report: Compares agent-generated patches against the gold PR diff using test pass/fail, file/hunk overlap, token cost, and paired t-test statistical significance.
Baseline comparison: Automatically runs a stripped baseline (no CLAUDE.md) alongside your test conditions so you can see the true delta.
Monorepo support: Pass a directory of CLAUDE.md files mirroring your repo structure to benchmark per-directory instruction trees.
SWE-bench compatibility: Import SWE-bench Lite tasks or export your own task set as SWE-bench JSONL for cross-benchmark comparisons.
Benchmark integrity: Uses git archive to create history-free checkouts, preventing the agent from walking future commits via git tags — closing the exploit seen in Claude 4 Sonnet on SWE-bench.
Security isolation: Each task runs in an isolated temp directory under /tmp; test commands and Claude Code are sandboxed per task.
Open source (MIT): Fully open source, installable via pip install mdarena, and extensible for custom grading or CI integration.

Community Discussions

Be the first to start a conversation about mdarena

Share your experience with mdarena, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source (MIT)

Fully free and open source under the MIT License. Install via pip and use without restrictions.

Mine merged PRs into benchmark task sets
Benchmark multiple CLAUDE.md files
Test pass/fail and diff overlap grading
SWE-bench import/export
Monorepo support

Capabilities

Key Features

Mine merged PRs into a reproducible benchmark task set
Benchmark multiple CLAUDE.md files head-to-head
Auto-detect test commands from CI/CD and package files
Grade patches via test pass/fail and diff overlap scoring
Statistical significance via paired t-test
Monorepo support with directory-based CLAUDE.md trees
SWE-bench import and export compatibility
History-free git checkouts to prevent benchmark exploitation
Baseline condition strips all CLAUDE.md files automatically
Token cost and usage tracking per condition

Integrations

Claude Code (claude CLI)

GitHub CLI (gh)

SWE-bench

GitHub Actions / CI workflows

pyproject.toml

package.json

Cargo.toml

go.mod

API Available

View Docs

Back to all tools