# SkillOpt

> SkillOpt trains reusable natural-language skill documents for frozen LLM agents using trajectory-driven edits, validation-gated updates, and a deep-learning-inspired optimization loop.

SkillOpt is a Microsoft Research project that applies deep-learning optimization discipline to natural-language agent skills — without touching model weights. Published alongside an arXiv preprint (2605.23904), it is available as an open-source Python library under the MIT License on GitHub, where it has accumulated over 3,600 stars since its May 2026 release.

## What It Is

SkillOpt treats a compact Markdown skill document as the trainable state of a frozen language agent. Instead of fine-tuning model weights or hand-crafting prompts, it runs the frozen target model on scored task batches, asks a separate optimizer model to propose structured add/delete/replace edits, and accepts a candidate skill only when it strictly improves a held-out validation score. The deployed artifact is a single `best_skill.md` file — typically 300–2,000 tokens — that runs against the unchanged target model at zero additional inference-time cost.

## How the Optimization Loop Works

The loop deliberately mirrors a neural-network training algorithm:

- **Rollout**: The target model executes tasks with the current skill and records scored trajectories.
- **Reflect**: The optimizer model analyzes success and failure minibatches separately to find reusable procedures.
- **Edit**: Candidate add, delete, and replace operations are merged and ranked under a textual learning-rate budget, preventing destructive rewrites.
- **Gate**: The candidate skill is kept only if it improves held-out selection performance.
- **Slow update / meta skill**: Epoch-boundary updates and an optimizer-side memory buffer provide longer-horizon feedback without bloating deployment.

Rejected edits are buffered as negative feedback so the optimizer avoids repeating harmful directions. The paper reports ablations showing that each of these controls — bounded edits, gated validation, rejected-edit buffer, and slow update — contributes measurably to final performance.

## Benchmark Results and Transfer

According to the paper and project page, SkillOpt is best or tied-best on all 52 evaluated (model, benchmark, harness) cells across six benchmarks (SearchQA, ALFWorld, DocVQA, LiveMathematicianBench, SpreadsheetBench, OfficeQA), seven target models, and three execution harnesses (direct chat, Codex CLI, Claude Code CLI). The paper reports that on GPT-5.5, SkillOpt lifts average no-skill accuracy by +23.5 points in direct chat, +24.8 inside the Codex agentic loop, and +19.1 inside Claude Code. The paper also demonstrates that optimized skill artifacts transfer across model scales, between Codex and Claude Code harnesses, and to nearby benchmarks without further optimization.

## Architecture and Deployment Model

The system separates the optimizer model (which proposes edits) from the target model (which executes tasks). This means a stronger optimizer can improve a weaker target, and even a matched target-as-optimizer setting can discover useful edits when updates are constrained, buffered, and validated. At deployment, the target model consumes only the final `best_skill.md` — no optimizer memory, no extra inference calls. The repository ships pretrained GPT-5.5 skill artifacts in `ckpt/` for direct evaluation without re-running training.

## Setup Path

SkillOpt requires Python 3.10+ and is installed via `pip install -e .`. It supports Azure OpenAI (recommended), OpenAI-compatible endpoints, Anthropic Claude, Qwen via local vLLM, and MiniMax. Configuration is YAML-based with a single `configs/_base_/default.yaml` as the source of truth; benchmark configs inherit from it. An optional Gradio-based WebUI monitoring dashboard is available via `pip install -e ".[webui]"`. The repository also includes an extensibility guide for adding new backends and benchmarks.

## Update: Initial Release (May 2026)

The repository was created on 2026-05-08 and last updated 2026-06-01. The paper (arXiv 2605.23904) was submitted in 2026. The project page notes that the first batch of pretrained skill artifacts has been uploaded to `ckpt/`, with remaining optimized skills and benchmark split manifests being cleaned and verified for future upload. A community-contributed optional soft-gate config (PR #25) has already been merged, indicating active early development.

## Features
- Text-space optimization for frozen LLM agents
- Trajectory-driven skill document editing (add/delete/replace)
- Held-out validation gating for candidate skill acceptance
- Textual learning-rate budget to prevent destructive rewrites
- Rejected-edit buffer for negative feedback memory
- Epoch-wise slow update and meta-skill for longer-horizon feedback
- Deployable best_skill.md artifact with zero inference-time overhead
- Pretrained GPT-5.5 skill artifacts for direct evaluation
- Support for Azure OpenAI, OpenAI-compatible, Anthropic Claude, Qwen, and MiniMax backends
- Three execution harnesses: direct chat, Codex CLI, Claude Code CLI
- Six supported benchmarks: SearchQA, ALFWorld, DocVQA, LiveMathematicianBench, SpreadsheetBench, OfficeQA
- Optional Gradio-based WebUI monitoring dashboard
- Extensible backend and benchmark plugin architecture
- Auto-resume from last completed training step
- Cross-model and cross-harness skill transfer

## Integrations
Azure OpenAI, OpenAI API, Anthropic Claude, Qwen (vLLM), MiniMax, Codex CLI, Claude Code CLI, ALFWorld, SearchQA, DocVQA, LiveMathematicianBench, SpreadsheetBench, OfficeQA, Gradio

## Platforms
WEB, API, DEVELOPER_SDK, CLI

## Pricing
Open Source

## Version
main

## Links
- Website: https://aka.ms/skillopt
- Documentation: https://github.com/microsoft/SkillOpt/tree/main/docs
- Repository: https://github.com/microsoft/SkillOpt
- EveryDev.ai: https://www.everydev.ai/tools/skillopt