EveryDev.ai
Sign inSubscribe
Explore AI Tools
  • AI Coding Assistants
  • Agent Frameworks
  • MCP Servers
  • AI Prompt Tools
  • Vibe Coding Tools
  • AI Design Tools
  • AI Database Tools
  • AI Website Builders
  • AI Testing Tools
  • LLM Evaluations
Follow Us
  • X / Twitter
  • LinkedIn
  • Reddit
  • Discord
  • Threads
  • Bluesky
  • Mastodon
  • YouTube
  • GitHub
  • Instagram
Get Started
  • About
  • Editorial Standards
  • Corrections & Disclosures
  • Community Guidelines
  • Advertise
  • Contact Us
  • Newsletter
  • Submit a Tool
  • Start a Discussion
  • Write A Blog
  • Share A Build
  • Terms of Service
  • Privacy Policy
Explore with AI
  • ChatGPT
  • Gemini
  • Claude
  • Grok
  • Perplexity
Agent Experience
  • llms.txt
Theme
With AI, Everyone is a Dev. EveryDev.ai © 2026
Main Menu
  • Tools
  • Developers
  • Topics
  • Discussions
  • Communities
  • News
  • Podcasts
  • Blogs
  • Builds
  • Contests
  • Compare
  • Arena
Create
    Home
    Tools

    2,407+ AI tools

    • New
    • Trending
    • Featured
    • Compare
    • Arena
    Categories
    • Agents1565
    • Coding1169
    • Infrastructure524
    • Marketing445
    • Design418
    • Projects381
    • Research353
    • Analytics328
    • Testing219
    • MCP207
    • Data203
    • Security189
    • Integration168
    • Learning154
    • Communication144
    • Prompts138
    • Extensions133
    • Commerce123
    • Voice122
    • DevOps97
    • Web75
    • Finance21
    1. Home
    2. Tools
    3. EnterpriseRAG-Bench
    EnterpriseRAG-Bench icon

    EnterpriseRAG-Bench

    Retrieval-Augmented Generation

    An open-source benchmark dataset of 500,000+ enterprise documents and 500 questions for evaluating RAG systems on realistic company internal data.

    Visit Website

    At a Glance

    Pricing
    Open Source

    Fully open-source benchmark dataset and evaluation code available on GitHub and HuggingFace at no cost.

    Engagement

    Available On

    Web
    API
    CLI

    Resources

    WebsiteDocsGitHubllms.txt

    Topics

    Retrieval-Augmented GenerationLLM EvaluationsAcademic Research

    Alternatives

    LOFTRagasFirecrawl
    Developer
    OnyxSan Francisco, CAEst. 2023$10.13M raised

    Listed May 2026

    About EnterpriseRAG-Bench

    EnterpriseRAG-Bench is an open-source benchmark released by Onyx (onyx.app) that provides a large-scale dataset of simulated company-internal documents and curated questions for evaluating Retrieval Augmented Generation (RAG) systems. It is available on GitHub under the MIT license and hosted on HuggingFace, with an accompanying public leaderboard. The project also includes a paper published on arXiv (2605.05253) authored by researchers from the Onyx team.

    What It Is

    EnterpriseRAG-Bench fills a gap in the RAG and information retrieval evaluation landscape: while most existing datasets focus on publicly accessible content (web search results, Stack Overflow, etc.), this benchmark focuses entirely on company-internal data. The dataset simulates a fictional AI inference company called "Redwood Inference" and covers the full breadth of enterprise knowledge sources — from Slack messages and emails to CRM records and engineering tickets.

    Dataset Composition

    The corpus contains slightly over 500,000 documents drawn from nine simulated source types:

    • Slack (~275,000): Internal channels and team discussions
    • Gmail (~120,000): Email threads from management, sales, and ICs
    • Linear (~35,000): Engineering, product, and design tickets
    • Google Drive (~25,000): Shared files and collaborative documents
    • HubSpot (~15,000): CRM records for sales
    • Fireflies (~10,000): Meeting transcripts
    • GitHub (~8,000): Pull requests and comments
    • Jira (~6,000): Support tickets
    • Confluence (~5,000): Wikis, runbooks, and structured documentation

    Question Categories

    The benchmark includes 500 questions across 10 categories designed to stress-test different RAG capabilities: Basic (175), Semantic (125), Intra-Document Reasoning (40), Project Related (40), Constrained (30), Conflicting Info (20), Completeness (20), Miscellaneous (20), High Level (10), and Info Not Found (20). An additional 100 metadata-dependent questions are available separately for teams interested in metadata-aware RAG, though these are excluded from the leaderboard due to differing evaluation criteria.

    Design Principles

    Five principles guide the dataset's construction, as described in the project's methodology documentation:

    1. Cross-document coherence — generation starts with human-in-the-loop scaffolding so documents share a common foundation
    2. Realistic volume distribution — document ratios across source types reflect real-world patterns
    3. Realistic noise — misfiled documents, near-duplicates, and conflicting facts are deliberately introduced
    4. Internal terminology — project codenames, acronyms, and organizational jargon are embedded throughout
    5. Generality — the generation framework supports diverse industries, company stages, and organizational structures

    Leaderboard and Submission

    A public leaderboard is hosted on HuggingFace Spaces. Onyx notes that it excludes itself from the leaderboard to avoid conflict of interest, given that it offers a commercial RAG product. Submissions require reproducibility: open-source systems must provide a reproduction guide, while closed-source systems must provide a sandbox or endpoint for verification. Submissions are made by contacting the Onyx team directly.

    Current Status

    The repository is actively maintained under the MIT license. The accompanying arXiv paper (2605.05253) is titled "EnterpriseRAG-Bench: A RAG Benchmark for Company Internal Knowledge" and lists a 2026 publication year. The dataset is downloadable from GitHub releases or HuggingFace, and the leaderboard is live.

    EnterpriseRAG-Bench - 1

    Community Discussions

    Be the first to start a conversation about EnterpriseRAG-Bench

    Share your experience with EnterpriseRAG-Bench, ask questions, or help others learn from your insights.

    Pricing

    OPEN SOURCE

    Open Source

    Fully open-source benchmark dataset and evaluation code available on GitHub and HuggingFace at no cost.

    • 500,000+ enterprise documents
    • 500 benchmark questions
    • Answer evaluation scripts
    • Dataset generation framework
    • MIT license

    Capabilities

    Key Features

    • 500,000+ simulated enterprise documents
    • 500 benchmark questions across 10 categories
    • 9 simulated data source types (Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira, Confluence)
    • Public leaderboard on HuggingFace Spaces
    • Answer evaluation scripts included
    • Dataset generation framework for custom industries and scales
    • 100 additional metadata-dependent questions
    • MIT-licensed open-source code
    • HuggingFace dataset hosting
    • arXiv paper with methodology documentation

    Integrations

    HuggingFace
    GitHub
    arXiv
    API Available
    View Docs

    Reviews & Ratings

    No ratings yet

    Be the first to rate EnterpriseRAG-Bench and help others make informed decisions.

    Developer

    Onyx

    Onyx builds the open source application layer for LLMs, making Generative AI accessible to every company in the world. Founded by Chris Weaver and Yuhong Sun, the team combines deep ML and product expertise to deliver a platform that goes far beyond a simple chat UI. Onyx integrates internal knowledge connectors, live web search, MCP actions, and code execution into a single deployable platform. Backed by Khosla Ventures, First Round Capital, and Y Combinator, Onyx serves teams from emerging startups to global enterprises.

    Founded 2023
    San Francisco, CA
    $10.13M raised
    29 employees

    Used by

    Ramp
    Netflix
    Thales Group
    Bitwarden
    Read more about Onyx
    WebsiteGitHub
    2 tools in directory

    Similar Tools

    LOFT icon

    LOFT

    LOFT (Long-context Frontiers) is a Google DeepMind benchmark for evaluating large language models on long-context retrieval and reasoning tasks across diverse modalities.

    Ragas icon

    Ragas

    Ragas is an open-source framework for evaluating and testing LLM applications, helping teams measure retrieval-augmented generation (RAG) pipeline quality with automated metrics.

    Firecrawl icon

    Firecrawl

    An open-source API to search, scrape, crawl, and interact with the web, converting any website into clean, LLM-ready markdown or structured JSON for AI agents and applications.

    Browse all tools

    Related Topics

    Retrieval-Augmented Generation

    RAG Systems that enhance LLM outputs by retrieving relevant information from external knowledge bases, combining the power of generative AI with information retrieval for more accurate and contextual responses.

    68 tools

    LLM Evaluations

    Platforms and frameworks for evaluating, testing, and benchmarking LLM systems and AI applications. These tools provide evaluators and evaluation models to score AI outputs, measure hallucinations, assess RAG quality, detect failures, and optimize model performance. Features include automated testing with LLM-as-a-judge metrics, component-level evaluation with tracing, regression testing in CI/CD pipelines, custom evaluator creation, dataset curation, and real-time monitoring of production systems. Teams use these solutions to validate prompt effectiveness, compare models side-by-side, ensure answer correctness and relevance, identify bias and toxicity, prevent PII leakage, and continuously improve AI product quality through experiments, benchmarks, and performance analytics.

    75 tools

    Academic Research

    AI tools designed specifically for academic and scientific research.

    37 tools
    Browse all topics
    Back to all tools
    Discussions