agent-skills-eval

Name: agent-skills-eval
Availability: OnlineOnly
Author: darkrishabh

A test runner for Agent Skills that evaluates whether your SKILL.md actually improves model performance by running evals with and without the skill loaded.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source under the MIT license. Install via npm or run with npx.

Engagement

Available On

Web

API

SDK

CLI

darkrishabhdarkrishabh builds open-source developer tooling for the AI…

Listed May 2026

About agent-skills-eval

agent-skills-eval is an open-source TypeScript CLI and SDK that brings empirical testing to the Agent Skills ecosystem. It runs every eval twice — once with the skill loaded into context and once without — then uses a judge model to grade both outputs side by side, giving developers concrete evidence of whether a skill improves model behavior. The project is MIT-licensed and published on npm under the agent-skills-eval package name.

What It Is

agent-skills-eval is a test framework purpose-built for the agentskills.io specification, which defines a standard for giving AI agents domain knowledge via SKILL.md files. The tool fills the gap between writing a skill and knowing whether it works: it automates the with_skill vs without_skill comparison, judge-grades the outputs against declared assertions, and produces portable JSON artifacts plus a static HTML report. It is framework-agnostic and works with any OpenAI-compatible API endpoint.

How the Evaluation Loop Works

The core mental model is a controlled A/B test per eval:

The same prompt is sent to the target model twice — once with the SKILL.md injected into context, once without (baseline)
A configurable judge model scores both outputs against the eval's expected_output and assertions
Results are written to a structured iteration-N/ workspace with benchmark.json, per-eval grading.json, and timing data
A static HTML report is generated showing pass rates, assertion-level judge reasoning, full outputs side by side, and token/latency metrics

The --baseline flag enables the comparison run; omitting it produces only the with_skill result.

CLI and SDK Surface

The tool ships both a one-liner CLI and a full TypeScript SDK for programmatic use:

CLI: npx agent-skills-eval ./skills --target gpt-4o-mini --judge gpt-4o-mini --baseline --strict
YAML config: supports root, workspace, concurrency, include/exclude globs, logging format (pretty, jsonl, silent), and report output path
TypeScript SDK: evaluateSkills() accepts typed config, streams events via onEvent, and supports consoleReporter() and jsonlReporter() out of the box
Custom providers: implement a five-field Provider interface to connect local model servers (Ollama, vLLM, llama.cpp), proprietary APIs, or mock providers for unit tests

agentskills.io Spec Compliance

The library implements the full agentskills.io specification end to end, including strict SKILL.md YAML frontmatter validation (required name and description, lowercase-hyphenated format, parent-directory name match), evals/evals.json schema, and the official iteration-N/<eval>/<mode>/ artifact layout. Beyond the spec, the SDK adds per-eval defaults, model params, tool definitions, deterministic tool_assertions, and a flat workspaceLayout: "flat" option for multi-skill dashboards.

Platform and Compatibility

agent-skills-eval is OpenAI-compatible by default and works with OpenAI, Together, Groq, Anthropic via OpenAI-compat layers, and local Llama servers — anything that speaks the OpenAI chat API. It requires Node.js (version specified in package.json) and is distributed via npm. Artifacts are plain JSON and JSONL, making them portable and easy to diff across runs or plug into custom dashboards.

Current Status

The repository was created in May 2026 and last updated on May 11, 2026. The project had accumulated 406 stars and 16 forks shortly after launch, with CI passing on the main branch. It is actively maintained under the MIT license with full documentation hosted on GitHub Pages at darkrishabh.github.io/agent-skills-eval.

Community Discussions

Be the first to start a conversation about agent-skills-eval

Share your experience with agent-skills-eval, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully free and open-source under the MIT license. Install via npm or run with npx.

Full CLI and TypeScript SDK
with_skill vs without_skill baseline comparison
Judge-graded outputs
Static HTML reports
Portable JSON/JSONL artifacts

Capabilities

Key Features

with_skill vs without_skill baseline comparison
Judge-graded outputs with cited assertions
TypeScript SDK and CLI
OpenAI-compatible provider support
Tool-call assertions for agent evals
Portable JSON and JSONL artifacts
Static HTML reports
YAML configuration file support
Custom provider interface
Concurrency control
agentskills.io spec compliance
SKILL.md frontmatter validation
Iteration-N artifact layout
JSONL event streaming
Per-eval grading.json and benchmark.json output

Integrations

OpenAI

Anthropic (via OpenAI-compat)

Together AI

Groq

Ollama

vLLM

llama.cpp

agentskills.io

API Available

View Docs

Back to all tools

agent-skills-eval

LLM Evaluations

A test runner for Agent Skills that evaluates whether your SKILL.md actually improves model performance by running evals with and without the skill loaded.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source under the MIT license. Install via npm or run with npx.

Engagement

Discussions

Available On

Web

API

SDK

CLI

Resources

Website Docs GitHub llms.txt

Topics

LLM Evaluations Agent Skill Registries Automated Testing

Alternatives

Ashr Giskard mdarena

Developer

darkrishabhdarkrishabh builds open-source developer tooling for the AI…

Listed May 2026

About agent-skills-eval

What It Is

How the Evaluation Loop Works

The core mental model is a controlled A/B test per eval:

The same prompt is sent to the target model twice — once with the SKILL.md injected into context, once without (baseline)
A configurable judge model scores both outputs against the eval's expected_output and assertions
Results are written to a structured iteration-N/ workspace with benchmark.json, per-eval grading.json, and timing data
A static HTML report is generated showing pass rates, assertion-level judge reasoning, full outputs side by side, and token/latency metrics

The --baseline flag enables the comparison run; omitting it produces only the with_skill result.

CLI and SDK Surface

The tool ships both a one-liner CLI and a full TypeScript SDK for programmatic use:

CLI: npx agent-skills-eval ./skills --target gpt-4o-mini --judge gpt-4o-mini --baseline --strict
YAML config: supports root, workspace, concurrency, include/exclude globs, logging format (pretty, jsonl, silent), and report output path
TypeScript SDK: evaluateSkills() accepts typed config, streams events via onEvent, and supports consoleReporter() and jsonlReporter() out of the box
Custom providers: implement a five-field Provider interface to connect local model servers (Ollama, vLLM, llama.cpp), proprietary APIs, or mock providers for unit tests

agentskills.io Spec Compliance

Platform and Compatibility

Current Status

Community Discussions

Be the first to start a conversation about agent-skills-eval

Share your experience with agent-skills-eval, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully free and open-source under the MIT license. Install via npm or run with npx.

Full CLI and TypeScript SDK
with_skill vs without_skill baseline comparison
Judge-graded outputs
Static HTML reports
Portable JSON/JSONL artifacts

Capabilities

Key Features

with_skill vs without_skill baseline comparison
Judge-graded outputs with cited assertions
TypeScript SDK and CLI
OpenAI-compatible provider support
Tool-call assertions for agent evals
Portable JSON and JSONL artifacts
Static HTML reports
YAML configuration file support
Custom provider interface
Concurrency control
agentskills.io spec compliance
SKILL.md frontmatter validation
Iteration-N artifact layout
JSONL event streaming
Per-eval grading.json and benchmark.json output

Integrations

OpenAI

Anthropic (via OpenAI-compat)

Together AI

Groq

Ollama

vLLM

llama.cpp

agentskills.io

API Available

View Docs

Back to all tools