EnterpriseRAG-Bench

Name: EnterpriseRAG-Bench
Availability: OnlineOnly
Author: Onyx

An open-source benchmark dataset of 500,000+ enterprise documents and 500 questions for evaluating RAG systems on realistic company internal data.

Visit Website

At a Glance

Pricing

Open Source

Fully open-source benchmark dataset and evaluation code available on GitHub and HuggingFace at no cost.

Engagement

Available On

Web

API

CLI

OnyxSan Francisco, CAEst. 2023$10.13M raised

Listed May 2026

About EnterpriseRAG-Bench

EnterpriseRAG-Bench is an open-source benchmark released by Onyx (onyx.app) that provides a large-scale dataset of simulated company-internal documents and curated questions for evaluating Retrieval Augmented Generation (RAG) systems. It is available on GitHub under the MIT license and hosted on HuggingFace, with an accompanying public leaderboard. The project also includes a paper published on arXiv (2605.05253) authored by researchers from the Onyx team.

What It Is

EnterpriseRAG-Bench fills a gap in the RAG and information retrieval evaluation landscape: while most existing datasets focus on publicly accessible content (web search results, Stack Overflow, etc.), this benchmark focuses entirely on company-internal data. The dataset simulates a fictional AI inference company called "Redwood Inference" and covers the full breadth of enterprise knowledge sources — from Slack messages and emails to CRM records and engineering tickets.

Dataset Composition

The corpus contains slightly over 500,000 documents drawn from nine simulated source types:

Slack (~275,000): Internal channels and team discussions
Gmail (~120,000): Email threads from management, sales, and ICs
Linear (~35,000): Engineering, product, and design tickets
Google Drive (~25,000): Shared files and collaborative documents
HubSpot (~15,000): CRM records for sales
Fireflies (~10,000): Meeting transcripts
GitHub (~8,000): Pull requests and comments
Jira (~6,000): Support tickets
Confluence (~5,000): Wikis, runbooks, and structured documentation

Question Categories

The benchmark includes 500 questions across 10 categories designed to stress-test different RAG capabilities: Basic (175), Semantic (125), Intra-Document Reasoning (40), Project Related (40), Constrained (30), Conflicting Info (20), Completeness (20), Miscellaneous (20), High Level (10), and Info Not Found (20). An additional 100 metadata-dependent questions are available separately for teams interested in metadata-aware RAG, though these are excluded from the leaderboard due to differing evaluation criteria.

Design Principles

Five principles guide the dataset's construction, as described in the project's methodology documentation:

Cross-document coherence — generation starts with human-in-the-loop scaffolding so documents share a common foundation
Realistic volume distribution — document ratios across source types reflect real-world patterns
Realistic noise — misfiled documents, near-duplicates, and conflicting facts are deliberately introduced
Internal terminology — project codenames, acronyms, and organizational jargon are embedded throughout
Generality — the generation framework supports diverse industries, company stages, and organizational structures

Leaderboard and Submission

A public leaderboard is hosted on HuggingFace Spaces. Onyx notes that it excludes itself from the leaderboard to avoid conflict of interest, given that it offers a commercial RAG product. Submissions require reproducibility: open-source systems must provide a reproduction guide, while closed-source systems must provide a sandbox or endpoint for verification. Submissions are made by contacting the Onyx team directly.

Current Status

The repository is actively maintained under the MIT license. The accompanying arXiv paper (2605.05253) is titled "EnterpriseRAG-Bench: A RAG Benchmark for Company Internal Knowledge" and lists a 2026 publication year. The dataset is downloadable from GitHub releases or HuggingFace, and the leaderboard is live.

Community Discussions

Be the first to start a conversation about EnterpriseRAG-Bench

Share your experience with EnterpriseRAG-Bench, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully open-source benchmark dataset and evaluation code available on GitHub and HuggingFace at no cost.

500,000+ enterprise documents
500 benchmark questions
Answer evaluation scripts
Dataset generation framework
MIT license

Capabilities

Key Features

500,000+ simulated enterprise documents
500 benchmark questions across 10 categories
9 simulated data source types (Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira, Confluence)
Public leaderboard on HuggingFace Spaces
Answer evaluation scripts included
Dataset generation framework for custom industries and scales
100 additional metadata-dependent questions
MIT-licensed open-source code
HuggingFace dataset hosting
arXiv paper with methodology documentation

Integrations

HuggingFace

GitHub

arXiv

API Available

View Docs

Back to all tools Suggest an edit

EnterpriseRAG-Bench

Retrieval-Augmented Generation

An open-source benchmark dataset of 500,000+ enterprise documents and 500 questions for evaluating RAG systems on realistic company internal data.

Visit Website

At a Glance

Pricing

Open Source

Fully open-source benchmark dataset and evaluation code available on GitHub and HuggingFace at no cost.

Engagement

ratings

discussions

5views

Available On

Web

API

CLI

Resources

Website Docs GitHub llms.txt

Topics

Retrieval-Augmented Generation LLM Evaluations Academic Research

Alternatives

RAGFlow Haystack RAG Techniques

Developer

OnyxSan Francisco, CAEst. 2023$10.13M raised

Listed May 2026

About EnterpriseRAG-Bench

What It Is

Dataset Composition

The corpus contains slightly over 500,000 documents drawn from nine simulated source types:

Slack (~275,000): Internal channels and team discussions
Gmail (~120,000): Email threads from management, sales, and ICs
Linear (~35,000): Engineering, product, and design tickets
Google Drive (~25,000): Shared files and collaborative documents
HubSpot (~15,000): CRM records for sales
Fireflies (~10,000): Meeting transcripts
GitHub (~8,000): Pull requests and comments
Jira (~6,000): Support tickets
Confluence (~5,000): Wikis, runbooks, and structured documentation

Question Categories

Design Principles

Five principles guide the dataset's construction, as described in the project's methodology documentation:

Cross-document coherence — generation starts with human-in-the-loop scaffolding so documents share a common foundation
Realistic volume distribution — document ratios across source types reflect real-world patterns
Realistic noise — misfiled documents, near-duplicates, and conflicting facts are deliberately introduced
Internal terminology — project codenames, acronyms, and organizational jargon are embedded throughout
Generality — the generation framework supports diverse industries, company stages, and organizational structures

Leaderboard and Submission

Current Status

Community Discussions

Be the first to start a conversation about EnterpriseRAG-Bench

Share your experience with EnterpriseRAG-Bench, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully open-source benchmark dataset and evaluation code available on GitHub and HuggingFace at no cost.

500,000+ enterprise documents
500 benchmark questions
Answer evaluation scripts
Dataset generation framework
MIT license

Capabilities

Key Features

500,000+ simulated enterprise documents
500 benchmark questions across 10 categories
9 simulated data source types (Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira, Confluence)
Public leaderboard on HuggingFace Spaces
Answer evaluation scripts included
Dataset generation framework for custom industries and scales
100 additional metadata-dependent questions
MIT-licensed open-source code
HuggingFace dataset hosting
arXiv paper with methodology documentation

Integrations

HuggingFace

GitHub

arXiv

API Available

View Docs

Back to all tools Suggest an edit