# PageIndex

> Vectorless, reasoning-based RAG system that builds hierarchical tree indexes from long documents and uses LLM reasoning for context-aware retrieval — no vector DB or chunking required.

PageIndex is an open-source RAG framework developed by Vectify AI that replaces vector similarity search with LLM-driven tree search over structured document indexes. It is available as a self-hosted Python package, a cloud chat platform, and an MCP/API service for developers and enterprises. The project is authored by a team of AI researchers from UCL and Oxford with backgrounds at Anthropic and UiPath.

## What It Is

PageIndex is a vectorless, reasoning-based retrieval-augmented generation (RAG) system. Instead of embedding documents into vector space and retrieving by cosine similarity, it builds a hierarchical "table of contents" tree index from a PDF or Markdown document, then uses an LLM to reason over that tree — simulating how a human expert would navigate a complex document. The core insight is that **similarity ≠ relevance**: professional documents require multi-step reasoning to find the right section, not approximate nearest-neighbor lookup.

Retrieval works in two steps:
- **Index generation**: the document is parsed into a semantic tree of titled nodes, each with a page range and LLM-generated summary.
- **Tree search**: at query time, an LLM reasons over the tree to identify and retrieve the most relevant nodes, incorporating full conversation history and domain context.

## Architecture: No Vectors, No Chunks

Traditional RAG pipelines split documents into fixed-size chunks, embed them, and store them in a vector database. PageIndex discards all three of those steps. Documents are organized into natural sections that mirror the document's own structure. Retrieval is traceable — every answer cites the specific page and section from which it was drawn, making results interpretable rather than opaque. The open-source package supports standard PDF parsing and Markdown files; the cloud service adds enhanced OCR and a more robust tree-building pipeline for complex PDFs.

Key architectural properties:
- No vector database dependency
- No chunking — sections follow document structure
- Context-aware retrieval that incorporates conversation history
- Page and section references for full traceability
- Multi-LLM support via LiteLLM (OpenAI, and other providers)

## Deployment Options

PageIndex offers three deployment paths:
- **Self-hosted**: run the open-source Python package locally with standard PDF parsing; install via `pip` and point at any PDF or Markdown file.
- **Cloud service**: production-grade pipeline with enhanced OCR, accessible via the PageIndex Chat platform, MCP integration, or REST API.
- **Enterprise**: private or on-premises deployment; contact the team for details.

The self-hosted path requires an LLM API key (e.g., OpenAI) and a few CLI commands. The cloud service is accessible immediately through the chat interface without any setup.

## Performance Signal: FinanceBench

The PageIndex team reports that Mafin 2.5 — a reasoning-based RAG system for financial document analysis powered by PageIndex — achieved 98.7% accuracy on the FinanceBench benchmark, which tests question answering over SEC filings and earnings disclosures. The team attributes this result to PageIndex's hierarchical indexing and reasoning-driven retrieval, which they claim significantly outperforms traditional vector-based RAG on this benchmark. Full benchmark results are published in the VectifyAI/Mafin2.5-FinanceBench GitHub repository.

## Update: Agentic Vectorless RAG and PageIndex File System

Recent updates to the project include two notable additions. The **PageIndex File System** extends the tree index to corpus-level search, allowing PageIndex to reason over millions of documents rather than a single file by adding a file-level tree layer above individual document trees. The **Agentic Vectorless RAG** example demonstrates an end-to-end agentic pipeline using the OpenAI Agents SDK with self-hosted PageIndex, providing a minimal but complete reference implementation. The project is cited as: Mingtian Zhang, Yu Tang and PageIndex Team, "PageIndex: Next-Generation Vectorless, Reasoning-based RAG," PageIndex Blog, Sep 2025.

## Who It Is For

PageIndex targets developers and enterprises working with long, complex professional documents — financial reports, regulatory filings, legal manuals, academic textbooks, and technical documentation that exceeds LLM context windows. The chat platform serves non-technical users who need verifiable, source-grounded answers from uploaded documents. The MCP and API interfaces serve developers integrating document intelligence into their own applications or agent pipelines.

## Features
- Vectorless RAG — no vector database or embeddings required
- Hierarchical tree index generation from PDF and Markdown documents
- Reasoning-based retrieval via LLM tree search
- Context-aware retrieval incorporating conversation history
- Page and section references for full traceability
- Agentic vectorless RAG with OpenAI Agents SDK
- PageIndex File System for corpus-scale search over millions of documents
- Vision-based RAG over PDF page images (no OCR)
- Multi-LLM support via LiteLLM
- Cloud service with enhanced OCR and tree-building pipeline
- MCP integration for developer workflows
- REST API access
- Chat platform for non-technical users
- Enterprise private/on-prem deployment option

## Integrations
OpenAI, LiteLLM, OpenAI Agents SDK, MCP (Model Context Protocol), REST API

## Platforms
WEB, API, CLI, DEVELOPER_SDK

## Pricing
Open Source

## Links
- Website: https://pageindex.ai
- Documentation: https://docs.pageindex.ai
- Repository: https://github.com/VectifyAI/PageIndex
- EveryDev.ai: https://www.everydev.ai/tools/pageindex
