Inspect AI
An open-source Python framework for large language model evaluations developed by the UK AI Security Institute, supporting agentic tasks, tool use, multi-turn dialog, and 200+ pre-built benchmarks.
At a Glance
About Inspect AI
Inspect is an open-source Python framework for large language model (LLM) evaluations, developed by the UK AI Security Institute (AISI) and Meridian Labs. It is available on GitHub under the MIT License and installable via PyPI. The framework targets a broad range of evaluation types—coding, agentic tasks, reasoning, knowledge, behavior, and multi-modal understanding—and ships with over 200 pre-built evaluations ready to run against any supported model.
What It Is
Inspect is a structured evaluation framework that organizes LLM assessments around three composable primitives: Datasets (labelled input/target samples), Solvers (chained prompt engineering and agent logic), and Scorers (output evaluation via text comparison, model grading, or custom schemes). This architecture lets researchers and engineers define reusable evaluation components and combine them into reproducible tasks. The @task decorator and inspect eval CLI command make it straightforward to run evaluations against any supported model provider from the command line or directly from Python.
Model Provider Coverage
Inspect supports a wide range of model providers out of the box:
- Cloud APIs: OpenAI, Anthropic, Google (Gemini), Grok, Mistral, AWS Bedrock, Azure AI, TogetherAI, Groq, Cloudflare, Goodfire
- Local inference: vLLM, Ollama, llama-cpp-python, TransformerLens, nnterp, Hugging Face Transformers
Each provider is configured by installing the relevant Python package and setting the appropriate API key environment variable, keeping the setup path consistent across providers.
Agentic and Tool Evaluation Capabilities
Inspect includes flexible support for evaluating agents and tool-using models:
- Built-in tools for bash execution, Python execution, text editing, web search, web browsing, and computer use
- Custom tool definitions and MCP (Model Context Protocol) tool integration
- Multi-agent primitives and support for running external agents such as Claude Code, Codex CLI, and Gemini CLI
- A sandboxing system for isolating untrusted model-generated code, with backends for Docker, Kubernetes, Modal, Proxmox, and a custom extension API
- Tool approval policies for fine-grained control over which tool calls models are permitted to make
Tooling and Developer Experience
Beyond the core evaluation engine, Inspect ships with a web-based Inspect View log viewer for monitoring and visualizing evaluation runs, and a VS Code Extension for authoring, debugging, and browsing logs directly in the editor. Evaluation logs are written locally by default and can be explored via inspect view in the browser. The framework also exposes a Python API (eval()) for programmatic use alongside the CLI, and supports structured output, reasoning model options, batch processing, adaptive concurrency, and early stopping.
Open-Source Lineage and Current Status
The repository was created in November 2023 and, according to the GitHub project page, was last updated in May 2026. It has accumulated over 2,100 stars and 517 forks. The project is maintained under the UKGovernmentBEIS GitHub organization and is released under the MIT License, making it freely usable, modifiable, and distributable. The documentation site at inspect.aisi.org.uk is actively maintained alongside the codebase, with the uv workflow supported for reproducible development environments.
Community Discussions
Be the first to start a conversation about Inspect AI
Share your experience with Inspect AI, ask questions, or help others learn from your insights.
Pricing
Open Source
Fully free and open-source under the MIT License. Install via pip and use with any supported model provider.
- Full framework access
- 200+ pre-built evaluations
- All built-in solvers, scorers, and tools
- VS Code Extension
- Web-based log viewer
Capabilities
Key Features
- 200+ pre-built LLM evaluations
- Composable Datasets, Solvers, and Scorers
- Built-in prompt engineering solvers (chain-of-thought, self-critique)
- Model-graded scoring
- Multi-turn dialog support
- Tool calling (bash, Python, text editing, web search, web browsing, computer use)
- MCP (Model Context Protocol) tool integration
- Custom tool definitions
- Multi-agent evaluation primitives
- Support for external agents (Claude Code, Codex CLI, Gemini CLI)
- Sandboxing via Docker, Kubernetes, Modal, Proxmox
- Tool approval policies
- Web-based Inspect View log viewer
- VS Code Extension for authoring and debugging
- CLI and Python API
- Structured output support
- Reasoning model support
- Batch processing mode
- Adaptive concurrency and rate-limit handling
- Multimodal evaluation (images, audio, video)
- Eval Sets for large-scale evaluation runs
- Early stopping API
- Caching of model outputs
- Extensions API for custom model providers, sandboxes, and storage
