EnterpriseRAG-Bench
An open-source benchmark dataset of 500,000+ enterprise documents and 500 questions for evaluating RAG systems on realistic company internal data.
At a Glance
Fully open-source benchmark dataset and evaluation code available on GitHub and HuggingFace at no cost.
Engagement
Available On
Listed May 2026
About EnterpriseRAG-Bench
EnterpriseRAG-Bench is an open-source benchmark released by Onyx (onyx.app) that provides a large-scale dataset of simulated company-internal documents and curated questions for evaluating Retrieval Augmented Generation (RAG) systems. It is available on GitHub under the MIT license and hosted on HuggingFace, with an accompanying public leaderboard. The project also includes a paper published on arXiv (2605.05253) authored by researchers from the Onyx team.
What It Is
EnterpriseRAG-Bench fills a gap in the RAG and information retrieval evaluation landscape: while most existing datasets focus on publicly accessible content (web search results, Stack Overflow, etc.), this benchmark focuses entirely on company-internal data. The dataset simulates a fictional AI inference company called "Redwood Inference" and covers the full breadth of enterprise knowledge sources — from Slack messages and emails to CRM records and engineering tickets.
Dataset Composition
The corpus contains slightly over 500,000 documents drawn from nine simulated source types:
- Slack (~275,000): Internal channels and team discussions
- Gmail (~120,000): Email threads from management, sales, and ICs
- Linear (~35,000): Engineering, product, and design tickets
- Google Drive (~25,000): Shared files and collaborative documents
- HubSpot (~15,000): CRM records for sales
- Fireflies (~10,000): Meeting transcripts
- GitHub (~8,000): Pull requests and comments
- Jira (~6,000): Support tickets
- Confluence (~5,000): Wikis, runbooks, and structured documentation
Question Categories
The benchmark includes 500 questions across 10 categories designed to stress-test different RAG capabilities: Basic (175), Semantic (125), Intra-Document Reasoning (40), Project Related (40), Constrained (30), Conflicting Info (20), Completeness (20), Miscellaneous (20), High Level (10), and Info Not Found (20). An additional 100 metadata-dependent questions are available separately for teams interested in metadata-aware RAG, though these are excluded from the leaderboard due to differing evaluation criteria.
Design Principles
Five principles guide the dataset's construction, as described in the project's methodology documentation:
- Cross-document coherence — generation starts with human-in-the-loop scaffolding so documents share a common foundation
- Realistic volume distribution — document ratios across source types reflect real-world patterns
- Realistic noise — misfiled documents, near-duplicates, and conflicting facts are deliberately introduced
- Internal terminology — project codenames, acronyms, and organizational jargon are embedded throughout
- Generality — the generation framework supports diverse industries, company stages, and organizational structures
Leaderboard and Submission
A public leaderboard is hosted on HuggingFace Spaces. Onyx notes that it excludes itself from the leaderboard to avoid conflict of interest, given that it offers a commercial RAG product. Submissions require reproducibility: open-source systems must provide a reproduction guide, while closed-source systems must provide a sandbox or endpoint for verification. Submissions are made by contacting the Onyx team directly.
Current Status
The repository is actively maintained under the MIT license. The accompanying arXiv paper (2605.05253) is titled "EnterpriseRAG-Bench: A RAG Benchmark for Company Internal Knowledge" and lists a 2026 publication year. The dataset is downloadable from GitHub releases or HuggingFace, and the leaderboard is live.
Community Discussions
Be the first to start a conversation about EnterpriseRAG-Bench
Share your experience with EnterpriseRAG-Bench, ask questions, or help others learn from your insights.
Pricing
Open Source
Fully open-source benchmark dataset and evaluation code available on GitHub and HuggingFace at no cost.
- 500,000+ enterprise documents
- 500 benchmark questions
- Answer evaluation scripts
- Dataset generation framework
- MIT license
Capabilities
Key Features
- 500,000+ simulated enterprise documents
- 500 benchmark questions across 10 categories
- 9 simulated data source types (Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira, Confluence)
- Public leaderboard on HuggingFace Spaces
- Answer evaluation scripts included
- Dataset generation framework for custom industries and scales
- 100 additional metadata-dependent questions
- MIT-licensed open-source code
- HuggingFace dataset hosting
- arXiv paper with methodology documentation
