simple-evals

Name: simple-evals
Availability: OnlineOnly
Author: OpenAI, Inc.

A lightweight, open-source Python library by OpenAI for evaluating language models across standard benchmarks like MMLU, MATH, GPQA, and SimpleQA.

Visit Website

At a Glance

Pricing

Open Source

Freely available under the MIT License. Use, modify, and distribute without restriction.

Engagement

Available On

CLI

API

OpenAI, Inc.San Francisco, CAEst. 2015$190B+ raised

Listed Jun 2026

About simple-evals

simple-evals is a lightweight Python library published by OpenAI under the MIT License for running standardized evaluations against large language models. It was open-sourced to provide transparency around the accuracy numbers OpenAI publishes alongside its latest models. The repository has accumulated over 4,500 GitHub stars since its creation in April 2024.

What It Is

simple-evals is a benchmark harness — a collection of eval implementations and sampling interfaces that let researchers and developers run reproducible accuracy tests against LLM APIs. Rather than being a comprehensive eval suite (that role belongs to the separate openai/evals repo), simple-evals focuses on a curated set of well-known academic benchmarks run in a zero-shot, chain-of-thought setting. The library is written in Python and designed to be run from the command line against the OpenAI or Anthropic APIs.

Benchmarks Included

The repository hosts reference implementations for the following evaluations:

MMLU — Massive Multitask Language Understanding (57 academic subjects)
MATH / MATH-500 — Mathematical problem solving; newer models are evaluated on the MATH-500 IID split
GPQA — Graduate-Level Google-Proof Q&A Benchmark
DROP — Reading comprehension requiring discrete reasoning (F1, 3-shot)
MGSM — Multilingual Grade School Math Benchmark
HumanEval — Code generation evaluation
SimpleQA — Short-form factuality benchmark developed by OpenAI
BrowseComp — Benchmark for browsing agents
HealthBench — Evaluating LLMs toward improved human health outcomes

Design Philosophy

The library deliberately emphasizes the zero-shot, chain-of-thought prompting setting with minimal instructions (e.g., "Solve the following multiple choice problem"). The README explains this choice: few-shot and role-playing prompts are carryovers from evaluating base models and older, less instruction-following models. For modern instruction-tuned chat models, zero-shot prompting is argued to be a better reflection of realistic usage. The library supports sampling interfaces for the OpenAI API and the Anthropic Claude API, with API keys set via environment variables.

Benchmark Results Published

OpenAI uses this repo to publish benchmark scores for its model families. The results table in the README covers o3, o4-mini, o3-mini, o1, GPT-4.1, GPT-4o, GPT-4.5-preview, GPT-4 Turbo, and GPT-4, as well as third-party reported results for Claude 3.5 Sonnet, Llama 3.1, Grok 2, and Gemini model families. Results are presented per model variant and reasoning level (e.g., o3-high, o3, o3-low).

Update: Deprecation Notice (July 2025)

As of July 2025, the repository README carries a deprecation notice: simple-evals will no longer be updated for new models or benchmark results. The repo will continue to host reference implementations for HealthBench, BrowseComp, and SimpleQA, but active development has ended. The last push to the repository was in April 2026, and the project remains available under MIT for anyone who wants to fork or build on it. The maintainers note they are not actively monitoring PRs or Issues and are not accepting new evals going forward.

Community Discussions

Be the first to start a conversation about simple-evals

Share your experience with simple-evals, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Freely available under the MIT License. Use, modify, and distribute without restriction.

Full source code access under MIT License
All benchmark implementations (MMLU, MATH, GPQA, DROP, MGSM, HumanEval, SimpleQA, BrowseComp, HealthBench)
OpenAI and Anthropic API sampling interfaces
Command-line runner

Capabilities

Key Features

Zero-shot chain-of-thought evaluation framework
MMLU benchmark implementation
MATH and MATH-500 benchmark implementation
GPQA benchmark implementation
DROP benchmark implementation
MGSM multilingual math benchmark
HumanEval code generation benchmark
SimpleQA factuality benchmark
BrowseComp browsing agent benchmark
HealthBench health-focused LLM evaluation
OpenAI API sampling interface
Anthropic Claude API sampling interface
Command-line runner with model selection
Benchmark results table for major model families

Integrations

OpenAI API

Anthropic Claude API

API Available

View Docs

Back to all tools Suggest an edit

simple-evals

LLM Evaluations

A lightweight, open-source Python library by OpenAI for evaluating language models across standard benchmarks like MMLU, MATH, GPQA, and SimpleQA.

Visit Website

At a Glance

Pricing

Open Source

Freely available under the MIT License. Use, modify, and distribute without restriction.

Engagement

ratings

discussions

Available On

CLI

API

Resources

Website Docs GitHub llms.txt

Topics

LLM Evaluations AI Development Libraries Academic Research

Alternatives

ZeroEval EnterpriseRAG-Bench Inspect AI

Developer

OpenAI, Inc.San Francisco, CAEst. 2015$190B+ raised

Listed Jun 2026

About simple-evals

What It Is

Benchmarks Included

The repository hosts reference implementations for the following evaluations:

MMLU — Massive Multitask Language Understanding (57 academic subjects)
MATH / MATH-500 — Mathematical problem solving; newer models are evaluated on the MATH-500 IID split
GPQA — Graduate-Level Google-Proof Q&A Benchmark
DROP — Reading comprehension requiring discrete reasoning (F1, 3-shot)
MGSM — Multilingual Grade School Math Benchmark
HumanEval — Code generation evaluation
SimpleQA — Short-form factuality benchmark developed by OpenAI
BrowseComp — Benchmark for browsing agents
HealthBench — Evaluating LLMs toward improved human health outcomes

Design Philosophy

Benchmark Results Published

Update: Deprecation Notice (July 2025)

Community Discussions

Be the first to start a conversation about simple-evals

Share your experience with simple-evals, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Freely available under the MIT License. Use, modify, and distribute without restriction.

Full source code access under MIT License
All benchmark implementations (MMLU, MATH, GPQA, DROP, MGSM, HumanEval, SimpleQA, BrowseComp, HealthBench)
OpenAI and Anthropic API sampling interfaces
Command-line runner

Capabilities

Key Features

Zero-shot chain-of-thought evaluation framework
MMLU benchmark implementation
MATH and MATH-500 benchmark implementation
GPQA benchmark implementation
DROP benchmark implementation
MGSM multilingual math benchmark
HumanEval code generation benchmark
SimpleQA factuality benchmark
BrowseComp browsing agent benchmark
HealthBench health-focused LLM evaluation
OpenAI API sampling interface
Anthropic Claude API sampling interface
Command-line runner with model selection
Benchmark results table for major model families

Integrations

OpenAI API

Anthropic Claude API

API Available

View Docs

Back to all tools Suggest an edit