simple-evals
A lightweight, open-source Python library by OpenAI for evaluating language models across standard benchmarks like MMLU, MATH, GPQA, and SimpleQA.
At a Glance
About simple-evals
simple-evals is a lightweight Python library published by OpenAI under the MIT License for running standardized evaluations against large language models. It was open-sourced to provide transparency around the accuracy numbers OpenAI publishes alongside its latest models. The repository has accumulated over 4,500 GitHub stars since its creation in April 2024.
What It Is
simple-evals is a benchmark harness — a collection of eval implementations and sampling interfaces that let researchers and developers run reproducible accuracy tests against LLM APIs. Rather than being a comprehensive eval suite (that role belongs to the separate openai/evals repo), simple-evals focuses on a curated set of well-known academic benchmarks run in a zero-shot, chain-of-thought setting. The library is written in Python and designed to be run from the command line against the OpenAI or Anthropic APIs.
Benchmarks Included
The repository hosts reference implementations for the following evaluations:
- MMLU — Massive Multitask Language Understanding (57 academic subjects)
- MATH / MATH-500 — Mathematical problem solving; newer models are evaluated on the MATH-500 IID split
- GPQA — Graduate-Level Google-Proof Q&A Benchmark
- DROP — Reading comprehension requiring discrete reasoning (F1, 3-shot)
- MGSM — Multilingual Grade School Math Benchmark
- HumanEval — Code generation evaluation
- SimpleQA — Short-form factuality benchmark developed by OpenAI
- BrowseComp — Benchmark for browsing agents
- HealthBench — Evaluating LLMs toward improved human health outcomes
Design Philosophy
The library deliberately emphasizes the zero-shot, chain-of-thought prompting setting with minimal instructions (e.g., "Solve the following multiple choice problem"). The README explains this choice: few-shot and role-playing prompts are carryovers from evaluating base models and older, less instruction-following models. For modern instruction-tuned chat models, zero-shot prompting is argued to be a better reflection of realistic usage. The library supports sampling interfaces for the OpenAI API and the Anthropic Claude API, with API keys set via environment variables.
Benchmark Results Published
OpenAI uses this repo to publish benchmark scores for its model families. The results table in the README covers o3, o4-mini, o3-mini, o1, GPT-4.1, GPT-4o, GPT-4.5-preview, GPT-4 Turbo, and GPT-4, as well as third-party reported results for Claude 3.5 Sonnet, Llama 3.1, Grok 2, and Gemini model families. Results are presented per model variant and reasoning level (e.g., o3-high, o3, o3-low).
Update: Deprecation Notice (July 2025)
As of July 2025, the repository README carries a deprecation notice: simple-evals will no longer be updated for new models or benchmark results. The repo will continue to host reference implementations for HealthBench, BrowseComp, and SimpleQA, but active development has ended. The last push to the repository was in April 2026, and the project remains available under MIT for anyone who wants to fork or build on it. The maintainers note they are not actively monitoring PRs or Issues and are not accepting new evals going forward.
Community Discussions
Be the first to start a conversation about simple-evals
Share your experience with simple-evals, ask questions, or help others learn from your insights.
Pricing
Open Source
Freely available under the MIT License. Use, modify, and distribute without restriction.
- Full source code access under MIT License
- All benchmark implementations (MMLU, MATH, GPQA, DROP, MGSM, HumanEval, SimpleQA, BrowseComp, HealthBench)
- OpenAI and Anthropic API sampling interfaces
- Command-line runner
Capabilities
Key Features
- Zero-shot chain-of-thought evaluation framework
- MMLU benchmark implementation
- MATH and MATH-500 benchmark implementation
- GPQA benchmark implementation
- DROP benchmark implementation
- MGSM multilingual math benchmark
- HumanEval code generation benchmark
- SimpleQA factuality benchmark
- BrowseComp browsing agent benchmark
- HealthBench health-focused LLM evaluation
- OpenAI API sampling interface
- Anthropic Claude API sampling interface
- Command-line runner with model selection
- Benchmark results table for major model families
