ZeroEval

Name: ZeroEval
Availability: OnlineOnly
Author: ZeroEval

Open-source evaluation framework for testing large language models with zero-shot prompting on reasoning and coding tasks.

Visit Website

At a Glance

Pricing

Open Source

Free open-source evaluation framework

Engagement

Available On

Web

API

ZeroEvalNew York, NYEst. 2025$500000 raised

Listed Feb 2026

About ZeroEval

ZeroEval is an open-source evaluation framework designed to benchmark large language models (LLMs) using zero-shot prompting techniques. The project focuses on assessing model capabilities across reasoning, mathematics, and coding tasks without requiring few-shot examples, providing a standardized way to compare different AI models' performance.

The framework evaluates models on multiple benchmark datasets including MMLU-Redux for general knowledge, MATH-500 for mathematical reasoning, CRUX for code understanding, and ZebraLogic for logical reasoning puzzles. ZeroEval maintains public leaderboards that track performance across various model families including OpenAI, Anthropic, Google, Meta, and open-source alternatives.

Key Features:

Zero-Shot Evaluation - Tests models without providing example solutions, measuring true generalization capabilities and reasoning abilities across diverse problem types.
Multiple Benchmark Support - Includes MMLU-Redux (knowledge), MATH-500 (mathematics), CRUX (code reasoning), and ZebraLogic (logic puzzles) for comprehensive model assessment.
Public Leaderboards - Maintains transparent rankings of model performance with detailed breakdowns by task category and difficulty level.
Open Source Framework - Fully open-source codebase available on GitHub, allowing researchers and developers to run evaluations locally and contribute improvements.
Reproducible Results - Provides standardized evaluation protocols ensuring consistent and comparable results across different model evaluations.
Multi-Model Support - Compatible with various LLM providers and architectures, enabling fair comparisons between proprietary and open-source models.

To get started with ZeroEval, clone the GitHub repository and follow the installation instructions in the documentation. The framework supports running evaluations through command-line interfaces, making it accessible for researchers conducting model comparisons. Results can be submitted to the public leaderboard for community visibility and benchmarking purposes.

Community Discussions

Be the first to start a conversation about ZeroEval

Share your experience with ZeroEval, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Free open-source evaluation framework

Full evaluation framework
All benchmark datasets
Public leaderboard access
Community support

Capabilities

Key Features

Zero-shot LLM evaluation
MMLU-Redux benchmark
MATH-500 mathematical reasoning
CRUX code understanding
ZebraLogic logical reasoning
Public leaderboards
Multi-model support
Reproducible evaluation protocols
Open-source framework

Integrations

OpenAI models

Anthropic Claude

Google Gemini

Meta Llama

Mistral

Qwen

DeepSeek

API Available

View Docs

Back to all tools

ZeroEval

LLM Evaluations

Open-source evaluation framework for testing large language models with zero-shot prompting on reasoning and coding tasks.

Visit Website

At a Glance

Pricing

Open Source

Free open-source evaluation framework

Engagement

25views

Discussions

Available On

Web

API

Resources

Website Docs GitHub llms.txt

Topics

LLM Evaluations AI Development Libraries AI Infrastructure

Alternatives

Inspect AI llmfit SkillsBench

Developer

ZeroEvalNew York, NYEst. 2025$500000 raised

Listed Feb 2026

About ZeroEval

Key Features:

Zero-Shot Evaluation - Tests models without providing example solutions, measuring true generalization capabilities and reasoning abilities across diverse problem types.
Multiple Benchmark Support - Includes MMLU-Redux (knowledge), MATH-500 (mathematics), CRUX (code reasoning), and ZebraLogic (logic puzzles) for comprehensive model assessment.
Public Leaderboards - Maintains transparent rankings of model performance with detailed breakdowns by task category and difficulty level.
Open Source Framework - Fully open-source codebase available on GitHub, allowing researchers and developers to run evaluations locally and contribute improvements.
Reproducible Results - Provides standardized evaluation protocols ensuring consistent and comparable results across different model evaluations.
Multi-Model Support - Compatible with various LLM providers and architectures, enabling fair comparisons between proprietary and open-source models.

Community Discussions

Be the first to start a conversation about ZeroEval

Share your experience with ZeroEval, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Free open-source evaluation framework

Full evaluation framework
All benchmark datasets
Public leaderboard access
Community support

Capabilities

Key Features

Zero-shot LLM evaluation
MMLU-Redux benchmark
MATH-500 mathematical reasoning
CRUX code understanding
ZebraLogic logical reasoning
Public leaderboards
Multi-model support
Reproducible evaluation protocols
Open-source framework

Integrations

OpenAI models

Anthropic Claude

Google Gemini

Meta Llama

Mistral

Qwen

DeepSeek

API Available

View Docs

Back to all tools