Main Menu
  • Tools
  • Developers
  • Topics
  • Discussions
  • Communities
  • News
  • Podcasts
  • Blogs
  • Builds
  • Contests
  • Compare
  • Arena
Create
    EveryDev.ai
    Sign inSubscribe
    Home
    Tools

    2,308+ AI tools

    • New
    • Trending
    • Featured
    • Compare
    • Arena
    Categories
    • Agents1228
    • Coding1045
    • Infrastructure455
    • Marketing414
    • Design374
    • Projects340
    • Analytics319
    • Research306
    • Testing200
    • Data171
    • Integration169
    • Security169
    • MCP164
    • Learning146
    • Communication131
    • Prompts122
    • Extensions120
    • Commerce116
    • Voice107
    • DevOps92
    • Web73
    • Finance19
    1. Home
    2. Tools
    3. agent-skills-eval
    agent-skills-eval icon

    agent-skills-eval

    LLM Evaluations

    A test runner for Agent Skills that evaluates whether your SKILL.md actually improves model performance by running evals with and without the skill loaded.

    Visit Website

    At a Glance

    Pricing
    Open Source

    Fully free and open-source under the MIT license. Install via npm or run with npx.

    Engagement

    Available On

    Web
    API
    SDK
    CLI

    Resources

    WebsiteDocsGitHubllms.txt

    Topics

    LLM EvaluationsAgent Skill RegistriesAutomated Testing

    Alternatives

    AshrGiskardmdarena
    Developer
    darkrishabhdarkrishabh builds open-source developer tooling for the AI…

    Listed May 2026

    About agent-skills-eval

    agent-skills-eval is an open-source TypeScript CLI and SDK that brings empirical testing to the Agent Skills ecosystem. It runs every eval twice — once with the skill loaded into context and once without — then uses a judge model to grade both outputs side by side, giving developers concrete evidence of whether a skill improves model behavior. The project is MIT-licensed and published on npm under the agent-skills-eval package name.

    What It Is

    agent-skills-eval is a test framework purpose-built for the agentskills.io specification, which defines a standard for giving AI agents domain knowledge via SKILL.md files. The tool fills the gap between writing a skill and knowing whether it works: it automates the with_skill vs without_skill comparison, judge-grades the outputs against declared assertions, and produces portable JSON artifacts plus a static HTML report. It is framework-agnostic and works with any OpenAI-compatible API endpoint.

    How the Evaluation Loop Works

    The core mental model is a controlled A/B test per eval:

    • The same prompt is sent to the target model twice — once with the SKILL.md injected into context, once without (baseline)
    • A configurable judge model scores both outputs against the eval's expected_output and assertions
    • Results are written to a structured iteration-N/ workspace with benchmark.json, per-eval grading.json, and timing data
    • A static HTML report is generated showing pass rates, assertion-level judge reasoning, full outputs side by side, and token/latency metrics

    The --baseline flag enables the comparison run; omitting it produces only the with_skill result.

    CLI and SDK Surface

    The tool ships both a one-liner CLI and a full TypeScript SDK for programmatic use:

    • CLI: npx agent-skills-eval ./skills --target gpt-4o-mini --judge gpt-4o-mini --baseline --strict
    • YAML config: supports root, workspace, concurrency, include/exclude globs, logging format (pretty, jsonl, silent), and report output path
    • TypeScript SDK: evaluateSkills() accepts typed config, streams events via onEvent, and supports consoleReporter() and jsonlReporter() out of the box
    • Custom providers: implement a five-field Provider interface to connect local model servers (Ollama, vLLM, llama.cpp), proprietary APIs, or mock providers for unit tests

    agentskills.io Spec Compliance

    The library implements the full agentskills.io specification end to end, including strict SKILL.md YAML frontmatter validation (required name and description, lowercase-hyphenated format, parent-directory name match), evals/evals.json schema, and the official iteration-N/<eval>/<mode>/ artifact layout. Beyond the spec, the SDK adds per-eval defaults, model params, tool definitions, deterministic tool_assertions, and a flat workspaceLayout: "flat" option for multi-skill dashboards.

    Platform and Compatibility

    agent-skills-eval is OpenAI-compatible by default and works with OpenAI, Together, Groq, Anthropic via OpenAI-compat layers, and local Llama servers — anything that speaks the OpenAI chat API. It requires Node.js (version specified in package.json) and is distributed via npm. Artifacts are plain JSON and JSONL, making them portable and easy to diff across runs or plug into custom dashboards.

    Current Status

    The repository was created in May 2026 and last updated on May 11, 2026. The project had accumulated 406 stars and 16 forks shortly after launch, with CI passing on the main branch. It is actively maintained under the MIT license with full documentation hosted on GitHub Pages at darkrishabh.github.io/agent-skills-eval.

    agent-skills-eval - 1

    Community Discussions

    Be the first to start a conversation about agent-skills-eval

    Share your experience with agent-skills-eval, ask questions, or help others learn from your insights.

    Pricing

    OPEN SOURCE

    Open Source

    Fully free and open-source under the MIT license. Install via npm or run with npx.

    • Full CLI and TypeScript SDK
    • with_skill vs without_skill baseline comparison
    • Judge-graded outputs
    • Static HTML reports
    • Portable JSON/JSONL artifacts

    Capabilities

    Key Features

    • with_skill vs without_skill baseline comparison
    • Judge-graded outputs with cited assertions
    • TypeScript SDK and CLI
    • OpenAI-compatible provider support
    • Tool-call assertions for agent evals
    • Portable JSON and JSONL artifacts
    • Static HTML reports
    • YAML configuration file support
    • Custom provider interface
    • Concurrency control
    • agentskills.io spec compliance
    • SKILL.md frontmatter validation
    • Iteration-N artifact layout
    • JSONL event streaming
    • Per-eval grading.json and benchmark.json output

    Integrations

    OpenAI
    Anthropic (via OpenAI-compat)
    Together AI
    Groq
    Ollama
    vLLM
    llama.cpp
    agentskills.io
    API Available
    View Docs

    Reviews & Ratings

    No ratings yet

    Be the first to rate agent-skills-eval and help others make informed decisions.

    Developer

    darkrishabh

    darkrishabh builds open-source developer tooling for the AI agent ecosystem. The agent-skills-eval project provides a test runner for the agentskills.io specification, enabling empirical evaluation of Agent Skills via a TypeScript CLI and SDK. The project is MIT-licensed and published on npm.

    Read more about darkrishabh
    WebsiteGitHub
    1 tool in directory

    Similar Tools

    Ashr icon

    Ashr

    Ashr is an AI agent evaluation platform that mimics production environments and user behavior to catch agent failures before they reach real users.

    Giskard icon

    Giskard

    Automated testing platform for LLM agents that detects hallucinations, security vulnerabilities, and quality issues through continuous red teaming.

    mdarena icon

    mdarena

    Benchmark your CLAUDE.md files against real merged PRs to measure whether your AI agent context files help or hurt performance and token costs.

    Browse all tools

    Related Topics

    LLM Evaluations

    Platforms and frameworks for evaluating, testing, and benchmarking LLM systems and AI applications. These tools provide evaluators and evaluation models to score AI outputs, measure hallucinations, assess RAG quality, detect failures, and optimize model performance. Features include automated testing with LLM-as-a-judge metrics, component-level evaluation with tracing, regression testing in CI/CD pipelines, custom evaluator creation, dataset curation, and real-time monitoring of production systems. Teams use these solutions to validate prompt effectiveness, compare models side-by-side, ensure answer correctness and relevance, identify bias and toxicity, prevent PII leakage, and continuously improve AI product quality through experiments, benchmarks, and performance analytics.

    67 tools

    Agent Skill Registries

    Registries and directories that catalog agent capabilities, skills, and competencies, enabling discovery and composition of agent abilities across platforms.

    48 tools

    Automated Testing

    AI-powered platforms that automate end-to-end testing processes with intelligent test case generation, execution, and reporting for faster, more reliable software delivery.

    87 tools
    Browse all topics
    Back to all tools
    Explore AI Tools
    • AI Coding Assistants
    • Agent Frameworks
    • MCP Servers
    • AI Prompt Tools
    • Vibe Coding Tools
    • AI Design Tools
    • AI Database Tools
    • AI Website Builders
    • AI Testing Tools
    • LLM Evaluations
    Follow Us
    • X / Twitter
    • LinkedIn
    • Reddit
    • Discord
    • Threads
    • Bluesky
    • Mastodon
    • YouTube
    • GitHub
    • Instagram
    Get Started
    • About
    • Editorial Standards
    • Corrections & Disclosures
    • Community Guidelines
    • Advertise
    • Contact Us
    • Newsletter
    • Submit a Tool
    • Start a Discussion
    • Write A Blog
    • Share A Build
    • Terms of Service
    • Privacy Policy
    Explore with AI
    • ChatGPT
    • Gemini
    • Claude
    • Grok
    • Perplexity
    Agent Experience
    • llms.txt
    Theme
    With AI, Everyone is a Dev. EveryDev.ai © 2026
    Discussions