Main Menu
  • Tools
  • Developers
  • Topics
  • Discussions
  • News
  • Blogs
  • Builds
  • Contests
  • Compare
Create
    EveryDev.ai
    Sign inSubscribe
    Home
    Tools

    1,944+ AI tools

    • New
    • Trending
    • Featured
    • Compare
    Categories
    • Agents1038
    • Coding971
    • Infrastructure415
    • Marketing398
    • Design335
    • Projects313
    • Analytics299
    • Research290
    • Testing183
    • Integration167
    • Data163
    • Security156
    • MCP145
    • Learning135
    • Communication120
    • Extensions114
    • Prompts110
    • Commerce106
    • Voice102
    • DevOps84
    • Web71
    • Finance18
    1. Home
    2. Tools
    3. ZeroEval
    ZeroEval icon

    ZeroEval

    LLM Evaluations

    Open-source evaluation framework for testing large language models with zero-shot prompting on reasoning and coding tasks.

    Visit Website

    At a Glance

    Pricing
    Open Source

    Free open-source evaluation framework

    Engagement

    Available On

    Web
    API

    Resources

    WebsiteDocsGitHubllms.txt

    Topics

    LLM EvaluationsAI Development LibrariesAI Infrastructure

    Alternatives

    llmfitSkillsBenchTruLens
    Developer
    ZeroEvalNew York, NYEst. 2025$500000 raised

    Listed Feb 2026

    About ZeroEval

    ZeroEval is an open-source evaluation framework designed to benchmark large language models (LLMs) using zero-shot prompting techniques. The project focuses on assessing model capabilities across reasoning, mathematics, and coding tasks without requiring few-shot examples, providing a standardized way to compare different AI models' performance.

    The framework evaluates models on multiple benchmark datasets including MMLU-Redux for general knowledge, MATH-500 for mathematical reasoning, CRUX for code understanding, and ZebraLogic for logical reasoning puzzles. ZeroEval maintains public leaderboards that track performance across various model families including OpenAI, Anthropic, Google, Meta, and open-source alternatives.

    Key Features:

    • Zero-Shot Evaluation - Tests models without providing example solutions, measuring true generalization capabilities and reasoning abilities across diverse problem types.

    • Multiple Benchmark Support - Includes MMLU-Redux (knowledge), MATH-500 (mathematics), CRUX (code reasoning), and ZebraLogic (logic puzzles) for comprehensive model assessment.

    • Public Leaderboards - Maintains transparent rankings of model performance with detailed breakdowns by task category and difficulty level.

    • Open Source Framework - Fully open-source codebase available on GitHub, allowing researchers and developers to run evaluations locally and contribute improvements.

    • Reproducible Results - Provides standardized evaluation protocols ensuring consistent and comparable results across different model evaluations.

    • Multi-Model Support - Compatible with various LLM providers and architectures, enabling fair comparisons between proprietary and open-source models.

    To get started with ZeroEval, clone the GitHub repository and follow the installation instructions in the documentation. The framework supports running evaluations through command-line interfaces, making it accessible for researchers conducting model comparisons. Results can be submitted to the public leaderboard for community visibility and benchmarking purposes.

    ZeroEval - 1

    Community Discussions

    Be the first to start a conversation about ZeroEval

    Share your experience with ZeroEval, ask questions, or help others learn from your insights.

    Pricing

    OPEN SOURCE

    Open Source

    Free open-source evaluation framework

    • Full evaluation framework
    • All benchmark datasets
    • Public leaderboard access
    • Community support

    Capabilities

    Key Features

    • Zero-shot LLM evaluation
    • MMLU-Redux benchmark
    • MATH-500 mathematical reasoning
    • CRUX code understanding
    • ZebraLogic logical reasoning
    • Public leaderboards
    • Multi-model support
    • Reproducible evaluation protocols
    • Open-source framework

    Integrations

    OpenAI models
    Anthropic Claude
    Google Gemini
    Meta Llama
    Mistral
    Qwen
    DeepSeek
    API Available
    View Docs

    Reviews & Ratings

    No ratings yet

    Be the first to rate ZeroEval and help others make informed decisions.

    Developer

    ZeroEval Team

    ZeroEval operates LLM Stats and publishes verifiable, high-quality benchmarks and leaderboards for AI models. The team builds evaluation infrastructure, benchmark suites, and public leaderboards to increase transparency in model capabilities. They maintain tools like model comparison, playground, and API documentation to enable researchers and practitioners to access benchmark data.

    Founded 2025
    New York, NY
    $500000 raised
    3 employees
    Read more about ZeroEval Team
    WebsiteX / Twitter
    1 tool in directory

    Similar Tools

    llmfit icon

    llmfit

    LLMFit is an open-source CLI tool for benchmarking and evaluating the performance of large language models across various tasks.

    SkillsBench icon

    SkillsBench

    An open-source evaluation framework that benchmarks how well AI agent skills work across diverse, expert-curated tasks in high-GDP-value domains.

    TruLens icon

    TruLens

    Open-source library for evaluating and tracking LLM applications with feedback functions and observability tools.

    Browse all tools

    Related Topics

    LLM Evaluations

    Platforms and frameworks for evaluating, testing, and benchmarking LLM systems and AI applications. These tools provide evaluators and evaluation models to score AI outputs, measure hallucinations, assess RAG quality, detect failures, and optimize model performance. Features include automated testing with LLM-as-a-judge metrics, component-level evaluation with tracing, regression testing in CI/CD pipelines, custom evaluator creation, dataset curation, and real-time monitoring of production systems. Teams use these solutions to validate prompt effectiveness, compare models side-by-side, ensure answer correctness and relevance, identify bias and toxicity, prevent PII leakage, and continuously improve AI product quality through experiments, benchmarks, and performance analytics.

    56 tools

    AI Development Libraries

    Programming libraries and frameworks that provide machine learning capabilities, model integration, and AI functionality for developers.

    132 tools

    AI Infrastructure

    Infrastructure designed for deploying and running AI models.

    183 tools
    Browse all topics
    Back to all tools
    Explore AI Tools
    • AI Coding Assistants
    • Agent Frameworks
    • MCP Servers
    • AI Prompt Tools
    • Vibe Coding Tools
    • AI Design Tools
    • AI Database Tools
    • AI Website Builders
    • AI Testing Tools
    • LLM Evaluations
    Follow Us
    • X / Twitter
    • LinkedIn
    • Reddit
    • Discord
    • Threads
    • Bluesky
    • Mastodon
    • YouTube
    • GitHub
    • Instagram
    Get Started
    • About
    • Editorial Standards
    • Corrections & Disclosures
    • Community Guidelines
    • Advertise
    • Contact Us
    • Newsletter
    • Submit a Tool
    • Start a Discussion
    • Write A Blog
    • Share A Build
    • Terms of Service
    • Privacy Policy
    Explore with AI
    • ChatGPT
    • Gemini
    • Claude
    • Grok
    • Perplexity
    Agent Experience
    • llms.txt
    Theme
    With AI, Everyone is a Dev. EveryDev.ai © 2026
    9views
    Discussions