# Inspect AI

> An open-source Python framework for large language model evaluations developed by the UK AI Security Institute, supporting agentic tasks, tool use, multi-turn dialog, and 200+ pre-built benchmarks.

Inspect is an open-source Python framework for large language model (LLM) evaluations, developed by the UK AI Security Institute (AISI) and Meridian Labs. It is available on GitHub under the MIT License and installable via PyPI. The framework targets a broad range of evaluation types—coding, agentic tasks, reasoning, knowledge, behavior, and multi-modal understanding—and ships with over 200 pre-built evaluations ready to run against any supported model.

## What It Is

Inspect is a structured evaluation framework that organizes LLM assessments around three composable primitives: **Datasets** (labelled input/target samples), **Solvers** (chained prompt engineering and agent logic), and **Scorers** (output evaluation via text comparison, model grading, or custom schemes). This architecture lets researchers and engineers define reusable evaluation components and combine them into reproducible tasks. The `@task` decorator and `inspect eval` CLI command make it straightforward to run evaluations against any supported model provider from the command line or directly from Python.

## Model Provider Coverage

Inspect supports a wide range of model providers out of the box:
- **Cloud APIs**: OpenAI, Anthropic, Google (Gemini), Grok, Mistral, AWS Bedrock, Azure AI, TogetherAI, Groq, Cloudflare, Goodfire
- **Local inference**: vLLM, Ollama, llama-cpp-python, TransformerLens, nnterp, Hugging Face Transformers

Each provider is configured by installing the relevant Python package and setting the appropriate API key environment variable, keeping the setup path consistent across providers.

## Agentic and Tool Evaluation Capabilities

Inspect includes flexible support for evaluating agents and tool-using models:
- Built-in tools for bash execution, Python execution, text editing, web search, web browsing, and computer use
- Custom tool definitions and MCP (Model Context Protocol) tool integration
- Multi-agent primitives and support for running external agents such as Claude Code, Codex CLI, and Gemini CLI
- A sandboxing system for isolating untrusted model-generated code, with backends for Docker, Kubernetes, Modal, Proxmox, and a custom extension API
- Tool approval policies for fine-grained control over which tool calls models are permitted to make

## Tooling and Developer Experience

Beyond the core evaluation engine, Inspect ships with a web-based **Inspect View** log viewer for monitoring and visualizing evaluation runs, and a **VS Code Extension** for authoring, debugging, and browsing logs directly in the editor. Evaluation logs are written locally by default and can be explored via `inspect view` in the browser. The framework also exposes a Python API (`eval()`) for programmatic use alongside the CLI, and supports structured output, reasoning model options, batch processing, adaptive concurrency, and early stopping.

## Open-Source Lineage and Current Status

The repository was created in November 2023 and, according to the GitHub project page, was last updated in May 2026. It has accumulated over 2,100 stars and 517 forks. The project is maintained under the `UKGovernmentBEIS` GitHub organization and is released under the MIT License, making it freely usable, modifiable, and distributable. The documentation site at `inspect.aisi.org.uk` is actively maintained alongside the codebase, with the `uv` workflow supported for reproducible development environments.

## Features
- 200+ pre-built LLM evaluations
- Composable Datasets, Solvers, and Scorers
- Built-in prompt engineering solvers (chain-of-thought, self-critique)
- Model-graded scoring
- Multi-turn dialog support
- Tool calling (bash, Python, text editing, web search, web browsing, computer use)
- MCP (Model Context Protocol) tool integration
- Custom tool definitions
- Multi-agent evaluation primitives
- Support for external agents (Claude Code, Codex CLI, Gemini CLI)
- Sandboxing via Docker, Kubernetes, Modal, Proxmox
- Tool approval policies
- Web-based Inspect View log viewer
- VS Code Extension for authoring and debugging
- CLI and Python API
- Structured output support
- Reasoning model support
- Batch processing mode
- Adaptive concurrency and rate-limit handling
- Multimodal evaluation (images, audio, video)
- Eval Sets for large-scale evaluation runs
- Early stopping API
- Caching of model outputs
- Extensions API for custom model providers, sandboxes, and storage

## Integrations
OpenAI, Anthropic, Google Gemini, Grok, Mistral, Hugging Face Transformers, AWS Bedrock, Azure AI, TogetherAI, Groq, Cloudflare, Goodfire, vLLM, Ollama, llama-cpp-python, TransformerLens, nnterp, Docker, Kubernetes, Modal, Proxmox, Model Context Protocol (MCP), Claude Code, Codex CLI, Gemini CLI, OpenAI Agents SDK, LangChain, Pydantic AI, VS Code

## Platforms
LINUX, API, VSC_EXTENSION, DEVELOPER_SDK, CLI

## Pricing
Open Source

## Links
- Website: https://inspect.aisi.org.uk/
- Documentation: https://inspect.aisi.org.uk/
- Repository: https://github.com/UKGovernmentBEIS/inspect_ai
- EveryDev.ai: https://www.everydev.ai/tools/inspect-ai
