SkillsBench

LLM Evaluations

An open-source evaluation framework that benchmarks how well AI agent skills work across diverse, expert-curated tasks in high-GDP-value domains.

Visit Website

At a Glance

Pricing

Open Source

Free and open source under MIT License

Engagement

8views

0saves

0discussions

Available On

Web

API

Resources

Website Docs GitHub llms.txt

Topics

LLM Evaluations AI Infrastructure Academic Research

About SkillsBench

SkillsBench is the first evaluation framework designed to measure how AI agent skills perform across diverse, expert-curated tasks spanning high-GDP-value domains. It provides a structured approach to benchmarking AI agents by evaluating them across three abstraction layers that mirror traditional computing systems: Skills, Agent Harness, and Models.

The framework enables researchers and developers to understand how domain-specific capabilities and workflows extend agent functionality, similar to how applications work on an operating system. SkillsBench includes a comprehensive task registry with 84 tasks across multiple domains including engineering, research, security, data visualization, and more.

Three-Layer Evaluation Architecture provides a systematic approach to benchmarking AI agents across Skills (domain-specific capabilities), Agent Harness (execution environment), and Models (foundational AI models) layers.
Comprehensive Task Registry includes 84 expert-curated tasks spanning diverse domains such as 3D geometry, control systems, BGP routing, citation verification, game mechanics, legal document processing, materials science, and seismology.
Agent Performance Leaderboard tracks pass rates across multiple agent-model configurations with detailed metrics including confidence intervals and normalized gain calculations.
Skills Impact Measurement quantifies the improvement in agent performance when using domain-specific skills versus without, showing gains of up to +23.3% in pass rates.
Open Source Framework released under MIT License, allowing the community to contribute tasks, evaluate agents, and extend the benchmark.
Multiple Agent Support evaluates various agent-model combinations including Gemini CLI, Claude Code, and Codex with different underlying models.

To get started with SkillsBench, visit the documentation to learn how to run evaluations on your coding agent's ability to use domain-specific skills. The framework supports community contributions, allowing developers to add new tasks to expand the benchmark's coverage across additional domains and use cases.

Community Discussions

Be the first to start a conversation about SkillsBench

Share your experience with SkillsBench, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Free and open source under MIT License

Full access to evaluation framework
84 expert-curated tasks
Agent performance leaderboard
Community contribution support
MIT License

View official pricing

Capabilities

Key Features

Three-layer evaluation architecture (Skills, Agent Harness, Models)
84 expert-curated tasks across diverse domains
Agent performance leaderboard with confidence intervals
Skills impact measurement and normalized gain calculation
Task registry with difficulty levels and domain tags
Sample trajectory visualization
Community contribution support
Open source under MIT License

Integrations

Gemini CLI

Claude Code

Codex

GPT models

Gemini models

API Available

View Docs

Back to all tools

SkillsBench

At a Glance

Pricing

Engagement

Available On

Resources

Topics

About SkillsBench

Community Discussions

Be the first to start a conversation about SkillsBench

Pricing

Open Source

Capabilities

Key Features

Integrations

MLCommons

SciArena

Epoch AI