# simple-evals

> A lightweight, open-source Python library by OpenAI for evaluating language models across standard benchmarks like MMLU, MATH, GPQA, and SimpleQA.

simple-evals is a lightweight Python library published by OpenAI under the MIT License for running standardized evaluations against large language models. It was open-sourced to provide transparency around the accuracy numbers OpenAI publishes alongside its latest models. The repository has accumulated over 4,500 GitHub stars since its creation in April 2024.

## What It Is

simple-evals is a benchmark harness — a collection of eval implementations and sampling interfaces that let researchers and developers run reproducible accuracy tests against LLM APIs. Rather than being a comprehensive eval suite (that role belongs to the separate `openai/evals` repo), simple-evals focuses on a curated set of well-known academic benchmarks run in a zero-shot, chain-of-thought setting. The library is written in Python and designed to be run from the command line against the OpenAI or Anthropic APIs.

## Benchmarks Included

The repository hosts reference implementations for the following evaluations:

- **MMLU** — Massive Multitask Language Understanding (57 academic subjects)
- **MATH / MATH-500** — Mathematical problem solving; newer models are evaluated on the MATH-500 IID split
- **GPQA** — Graduate-Level Google-Proof Q&A Benchmark
- **DROP** — Reading comprehension requiring discrete reasoning (F1, 3-shot)
- **MGSM** — Multilingual Grade School Math Benchmark
- **HumanEval** — Code generation evaluation
- **SimpleQA** — Short-form factuality benchmark developed by OpenAI
- **BrowseComp** — Benchmark for browsing agents
- **HealthBench** — Evaluating LLMs toward improved human health outcomes

## Design Philosophy

The library deliberately emphasizes the **zero-shot, chain-of-thought** prompting setting with minimal instructions (e.g., "Solve the following multiple choice problem"). The README explains this choice: few-shot and role-playing prompts are carryovers from evaluating base models and older, less instruction-following models. For modern instruction-tuned chat models, zero-shot prompting is argued to be a better reflection of realistic usage. The library supports sampling interfaces for the OpenAI API and the Anthropic Claude API, with API keys set via environment variables.

## Benchmark Results Published

OpenAI uses this repo to publish benchmark scores for its model families. The results table in the README covers o3, o4-mini, o3-mini, o1, GPT-4.1, GPT-4o, GPT-4.5-preview, GPT-4 Turbo, and GPT-4, as well as third-party reported results for Claude 3.5 Sonnet, Llama 3.1, Grok 2, and Gemini model families. Results are presented per model variant and reasoning level (e.g., o3-high, o3, o3-low).

## Update: Deprecation Notice (July 2025)

As of July 2025, the repository README carries a deprecation notice: simple-evals **will no longer be updated for new models or benchmark results**. The repo will continue to host reference implementations for HealthBench, BrowseComp, and SimpleQA, but active development has ended. The last push to the repository was in April 2026, and the project remains available under MIT for anyone who wants to fork or build on it. The maintainers note they are not actively monitoring PRs or Issues and are not accepting new evals going forward.

## Features
- Zero-shot chain-of-thought evaluation framework
- MMLU benchmark implementation
- MATH and MATH-500 benchmark implementation
- GPQA benchmark implementation
- DROP benchmark implementation
- MGSM multilingual math benchmark
- HumanEval code generation benchmark
- SimpleQA factuality benchmark
- BrowseComp browsing agent benchmark
- HealthBench health-focused LLM evaluation
- OpenAI API sampling interface
- Anthropic Claude API sampling interface
- Command-line runner with model selection
- Benchmark results table for major model families

## Integrations
OpenAI API, Anthropic Claude API

## Platforms
CLI, API

## Pricing
Open Source

## Links
- Website: https://github.com/openai/simple-evals
- Documentation: https://github.com/openai/simple-evals
- Repository: https://github.com/openai/simple-evals
- EveryDev.ai: https://www.everydev.ai/tools/simple-evals