# agent-skills-eval

> A test runner for Agent Skills that evaluates whether your SKILL.md actually improves model performance by running evals with and without the skill loaded.

agent-skills-eval is an open-source TypeScript CLI and SDK that brings empirical testing to the Agent Skills ecosystem. It runs every eval twice — once with the skill loaded into context and once without — then uses a judge model to grade both outputs side by side, giving developers concrete evidence of whether a skill improves model behavior. The project is MIT-licensed and published on npm under the `agent-skills-eval` package name.

## What It Is

agent-skills-eval is a test framework purpose-built for the [agentskills.io](https://agentskills.io) specification, which defines a standard for giving AI agents domain knowledge via `SKILL.md` files. The tool fills the gap between writing a skill and knowing whether it works: it automates the `with_skill` vs `without_skill` comparison, judge-grades the outputs against declared assertions, and produces portable JSON artifacts plus a static HTML report. It is framework-agnostic and works with any OpenAI-compatible API endpoint.

## How the Evaluation Loop Works

The core mental model is a controlled A/B test per eval:

- The same prompt is sent to the target model twice — once with the `SKILL.md` injected into context, once without (baseline)
- A configurable judge model scores both outputs against the eval's `expected_output` and `assertions`
- Results are written to a structured `iteration-N/` workspace with `benchmark.json`, per-eval `grading.json`, and timing data
- A static HTML report is generated showing pass rates, assertion-level judge reasoning, full outputs side by side, and token/latency metrics

The `--baseline` flag enables the comparison run; omitting it produces only the `with_skill` result.

## CLI and SDK Surface

The tool ships both a one-liner CLI and a full TypeScript SDK for programmatic use:

- **CLI**: `npx agent-skills-eval ./skills --target gpt-4o-mini --judge gpt-4o-mini --baseline --strict`
- **YAML config**: supports `root`, `workspace`, `concurrency`, `include`/`exclude` globs, logging format (`pretty`, `jsonl`, `silent`), and report output path
- **TypeScript SDK**: `evaluateSkills()` accepts typed config, streams events via `onEvent`, and supports `consoleReporter()` and `jsonlReporter()` out of the box
- **Custom providers**: implement a five-field `Provider` interface to connect local model servers (Ollama, vLLM, llama.cpp), proprietary APIs, or mock providers for unit tests

## agentskills.io Spec Compliance

The library implements the full agentskills.io specification end to end, including strict `SKILL.md` YAML frontmatter validation (required `name` and `description`, lowercase-hyphenated format, parent-directory name match), `evals/evals.json` schema, and the official `iteration-N/<eval>/<mode>/` artifact layout. Beyond the spec, the SDK adds per-eval `defaults`, model `params`, tool definitions, deterministic `tool_assertions`, and a flat `workspaceLayout: "flat"` option for multi-skill dashboards.

## Platform and Compatibility

agent-skills-eval is OpenAI-compatible by default and works with OpenAI, Together, Groq, Anthropic via OpenAI-compat layers, and local Llama servers — anything that speaks the OpenAI chat API. It requires Node.js (version specified in `package.json`) and is distributed via npm. Artifacts are plain JSON and JSONL, making them portable and easy to diff across runs or plug into custom dashboards.

## Current Status

The repository was created in May 2026 and last updated on May 11, 2026. The project had accumulated 406 stars and 16 forks shortly after launch, with CI passing on the main branch. It is actively maintained under the MIT license with full documentation hosted on GitHub Pages at `darkrishabh.github.io/agent-skills-eval`.

## Features
- with_skill vs without_skill baseline comparison
- Judge-graded outputs with cited assertions
- TypeScript SDK and CLI
- OpenAI-compatible provider support
- Tool-call assertions for agent evals
- Portable JSON and JSONL artifacts
- Static HTML reports
- YAML configuration file support
- Custom provider interface
- Concurrency control
- agentskills.io spec compliance
- SKILL.md frontmatter validation
- Iteration-N artifact layout
- JSONL event streaming
- Per-eval grading.json and benchmark.json output

## Integrations
OpenAI, Anthropic (via OpenAI-compat), Together AI, Groq, Ollama, vLLM, llama.cpp, agentskills.io

## Platforms
WEB, API, DEVELOPER_SDK, CLI

## Pricing
Open Source

## Links
- Website: https://darkrishabh.github.io/agent-skills-eval/
- Documentation: https://darkrishabh.github.io/agent-skills-eval/
- Repository: https://github.com/darkrishabh/agent-skills-eval
- EveryDev.ai: https://www.everydev.ai/tools/agent-skills-eval