EveryDev.ai
Subscribe
Home
Tools

2,911+ AI tools

  • New
  • Trending
  • Featured
  • Compare
  • Arena
Categories
  • Agents1815
  • Coding1295
  • Infrastructure600
  • Marketing467
  • Projects433
  • Research403
  • Analytics351
  • Design338
  • Security243
  • MCP242
  • Testing238
  • Data230
  • Integration178
  • Prompts160
  • Learning159
  • Communication154
  • Extensions150
  • Voice130
  • Commerce125
  • DevOps108
  • Web80
  • Finance21
AI Tools by Topic
  • AI Coding Assistants
  • Agent Frameworks
  • MCP Servers
  • AI Prompt Tools
  • Vibe Coding Tools
  • AI Design Tools
  • AI Database Tools
  • AI Website Builders
  • AI Testing Tools
  • LLM Evaluations
Follow Us
  • X / Twitter
  • LinkedIn
  • Reddit
  • Discord
  • Threads
  • Bluesky
  • Mastodon
  • YouTube
  • GitHub
  • Instagram
Get Started
  • About
  • Editorial Standards
  • Corrections & Disclosures
  • Community Guidelines
  • Advertise
  • Contact Us
  • Newsletter
  • Submit a Tool
  • Start a Discussion
  • Write A Blog
  • Share A Build
  • Terms of Service
  • Privacy Policy
Explore with AI
  • ChatGPT
  • Gemini
  • Claude
  • Grok
  • Perplexity
Agent Experience
  • llms.txt
Theme
With AI, Everyone is a Dev. EveryDev.ai © 2026
    1. Home
    2. Tools
    3. simple-evals
    simple-evals icon

    simple-evals

    LLM Evaluations
    Featured

    A lightweight, open-source Python library by OpenAI for evaluating language models across standard benchmarks like MMLU, MATH, GPQA, and SimpleQA.

    Visit Website

    At a Glance

    Pricing
    Open Source

    Freely available under the MIT License. Use, modify, and distribute without restriction.

    Engagement

    Available On

    CLI
    API

    Resources

    WebsiteDocsGitHubllms.txt

    Topics

    LLM EvaluationsAI Development LibrariesAcademic Research

    Alternatives

    ZeroEvalEnterpriseRAG-BenchInspect AI
    Developer
    OpenAI, Inc.San Francisco, CAEst. 2015$190B+ raised

    Listed Jun 2026

    About simple-evals

    simple-evals is a lightweight Python library published by OpenAI under the MIT License for running standardized evaluations against large language models. It was open-sourced to provide transparency around the accuracy numbers OpenAI publishes alongside its latest models. The repository has accumulated over 4,500 GitHub stars since its creation in April 2024.

    What It Is

    simple-evals is a benchmark harness — a collection of eval implementations and sampling interfaces that let researchers and developers run reproducible accuracy tests against LLM APIs. Rather than being a comprehensive eval suite (that role belongs to the separate openai/evals repo), simple-evals focuses on a curated set of well-known academic benchmarks run in a zero-shot, chain-of-thought setting. The library is written in Python and designed to be run from the command line against the OpenAI or Anthropic APIs.

    Benchmarks Included

    The repository hosts reference implementations for the following evaluations:

    • MMLU — Massive Multitask Language Understanding (57 academic subjects)
    • MATH / MATH-500 — Mathematical problem solving; newer models are evaluated on the MATH-500 IID split
    • GPQA — Graduate-Level Google-Proof Q&A Benchmark
    • DROP — Reading comprehension requiring discrete reasoning (F1, 3-shot)
    • MGSM — Multilingual Grade School Math Benchmark
    • HumanEval — Code generation evaluation
    • SimpleQA — Short-form factuality benchmark developed by OpenAI
    • BrowseComp — Benchmark for browsing agents
    • HealthBench — Evaluating LLMs toward improved human health outcomes

    Design Philosophy

    The library deliberately emphasizes the zero-shot, chain-of-thought prompting setting with minimal instructions (e.g., "Solve the following multiple choice problem"). The README explains this choice: few-shot and role-playing prompts are carryovers from evaluating base models and older, less instruction-following models. For modern instruction-tuned chat models, zero-shot prompting is argued to be a better reflection of realistic usage. The library supports sampling interfaces for the OpenAI API and the Anthropic Claude API, with API keys set via environment variables.

    Benchmark Results Published

    OpenAI uses this repo to publish benchmark scores for its model families. The results table in the README covers o3, o4-mini, o3-mini, o1, GPT-4.1, GPT-4o, GPT-4.5-preview, GPT-4 Turbo, and GPT-4, as well as third-party reported results for Claude 3.5 Sonnet, Llama 3.1, Grok 2, and Gemini model families. Results are presented per model variant and reasoning level (e.g., o3-high, o3, o3-low).

    Update: Deprecation Notice (July 2025)

    As of July 2025, the repository README carries a deprecation notice: simple-evals will no longer be updated for new models or benchmark results. The repo will continue to host reference implementations for HealthBench, BrowseComp, and SimpleQA, but active development has ended. The last push to the repository was in April 2026, and the project remains available under MIT for anyone who wants to fork or build on it. The maintainers note they are not actively monitoring PRs or Issues and are not accepting new evals going forward.

    simple-evals - 1

    Community Discussions

    Be the first to start a conversation about simple-evals

    Share your experience with simple-evals, ask questions, or help others learn from your insights.

    Pricing

    OPEN SOURCE

    Open Source

    Freely available under the MIT License. Use, modify, and distribute without restriction.

    • Full source code access under MIT License
    • All benchmark implementations (MMLU, MATH, GPQA, DROP, MGSM, HumanEval, SimpleQA, BrowseComp, HealthBench)
    • OpenAI and Anthropic API sampling interfaces
    • Command-line runner

    Capabilities

    Key Features

    • Zero-shot chain-of-thought evaluation framework
    • MMLU benchmark implementation
    • MATH and MATH-500 benchmark implementation
    • GPQA benchmark implementation
    • DROP benchmark implementation
    • MGSM multilingual math benchmark
    • HumanEval code generation benchmark
    • SimpleQA factuality benchmark
    • BrowseComp browsing agent benchmark
    • HealthBench health-focused LLM evaluation
    • OpenAI API sampling interface
    • Anthropic Claude API sampling interface
    • Command-line runner with model selection
    • Benchmark results table for major model families

    Integrations

    OpenAI API
    Anthropic Claude API
    API Available
    View Docs

    Ratings & Reviews

    No ratings yet

    Be the first to rate simple-evals and help others make informed decisions.

    Developer

    OpenAI, Inc.

    OpenAI is an AI research and deployment company dedicated to ensuring that artificial general intelligence benefits all of humanity. They develop powerful AI systems like GPT and DALL-E and provide access to them through their commercial API services.

    Founded 2015
    San Francisco, CA
    $190B+ raised
    4,500 employees

    Used by

    Amgen
    Cisco
    Morgan Stanley
    Target
    +4 more
    Read more about OpenAI, Inc.
    WebsiteGitHubX / Twitter
    10 tools in directory

    Similar Tools

    ZeroEval icon

    ZeroEval

    Open-source evaluation framework for testing large language models with zero-shot prompting on reasoning and coding tasks.

    EnterpriseRAG-Bench icon

    EnterpriseRAG-Bench

    An open-source benchmark dataset of 500,000+ enterprise documents and 500 questions for evaluating RAG systems on realistic company internal data.

    Inspect AI icon

    Inspect AI

    An open-source Python framework for large language model evaluations developed by the UK AI Security Institute, supporting agentic tasks, tool use, multi-turn dialog, and 200+ pre-built benchmarks.

    Browse all tools

    Related Topics

    LLM Evaluations

    Platforms and frameworks for evaluating, testing, and benchmarking LLM systems and AI applications. These tools provide evaluators and evaluation models to score AI outputs, measure hallucinations, assess RAG quality, detect failures, and optimize model performance. Features include automated testing with LLM-as-a-judge metrics, component-level evaluation with tracing, regression testing in CI/CD pipelines, custom evaluator creation, dataset curation, and real-time monitoring of production systems. Teams use these solutions to validate prompt effectiveness, compare models side-by-side, ensure answer correctness and relevance, identify bias and toxicity, prevent PII leakage, and continuously improve AI product quality through experiments, benchmarks, and performance analytics.

    96 tools

    AI Development Libraries

    Programming libraries and frameworks that provide machine learning capabilities, model integration, and AI functionality for developers.

    228 tools

    Academic Research

    AI tools designed specifically for academic and scientific research.

    51 tools
    Browse all topics
    Back to all toolsSuggest an edit
    ratings
    discussions