PageIndex
Vectorless, reasoning-based RAG system that builds hierarchical tree indexes from long documents and uses LLM reasoning for context-aware retrieval — no vector DB or chunking required.
At a Glance
Self-hosted open-source package available under the MIT License. Free to use, modify, and distribute.
Engagement
Available On
Alternatives
Listed May 2026
About PageIndex
PageIndex is an open-source RAG framework developed by Vectify AI that replaces vector similarity search with LLM-driven tree search over structured document indexes. It is available as a self-hosted Python package, a cloud chat platform, and an MCP/API service for developers and enterprises. The project is authored by a team of AI researchers from UCL and Oxford with backgrounds at Anthropic and UiPath.
What It Is
PageIndex is a vectorless, reasoning-based retrieval-augmented generation (RAG) system. Instead of embedding documents into vector space and retrieving by cosine similarity, it builds a hierarchical "table of contents" tree index from a PDF or Markdown document, then uses an LLM to reason over that tree — simulating how a human expert would navigate a complex document. The core insight is that similarity ≠ relevance: professional documents require multi-step reasoning to find the right section, not approximate nearest-neighbor lookup.
Retrieval works in two steps:
- Index generation: the document is parsed into a semantic tree of titled nodes, each with a page range and LLM-generated summary.
- Tree search: at query time, an LLM reasons over the tree to identify and retrieve the most relevant nodes, incorporating full conversation history and domain context.
Architecture: No Vectors, No Chunks
Traditional RAG pipelines split documents into fixed-size chunks, embed them, and store them in a vector database. PageIndex discards all three of those steps. Documents are organized into natural sections that mirror the document's own structure. Retrieval is traceable — every answer cites the specific page and section from which it was drawn, making results interpretable rather than opaque. The open-source package supports standard PDF parsing and Markdown files; the cloud service adds enhanced OCR and a more robust tree-building pipeline for complex PDFs.
Key architectural properties:
- No vector database dependency
- No chunking — sections follow document structure
- Context-aware retrieval that incorporates conversation history
- Page and section references for full traceability
- Multi-LLM support via LiteLLM (OpenAI, and other providers)
Deployment Options
PageIndex offers three deployment paths:
- Self-hosted: run the open-source Python package locally with standard PDF parsing; install via
pipand point at any PDF or Markdown file. - Cloud service: production-grade pipeline with enhanced OCR, accessible via the PageIndex Chat platform, MCP integration, or REST API.
- Enterprise: private or on-premises deployment; contact the team for details.
The self-hosted path requires an LLM API key (e.g., OpenAI) and a few CLI commands. The cloud service is accessible immediately through the chat interface without any setup.
Performance Signal: FinanceBench
The PageIndex team reports that Mafin 2.5 — a reasoning-based RAG system for financial document analysis powered by PageIndex — achieved 98.7% accuracy on the FinanceBench benchmark, which tests question answering over SEC filings and earnings disclosures. The team attributes this result to PageIndex's hierarchical indexing and reasoning-driven retrieval, which they claim significantly outperforms traditional vector-based RAG on this benchmark. Full benchmark results are published in the VectifyAI/Mafin2.5-FinanceBench GitHub repository.
Update: Agentic Vectorless RAG and PageIndex File System
Recent updates to the project include two notable additions. The PageIndex File System extends the tree index to corpus-level search, allowing PageIndex to reason over millions of documents rather than a single file by adding a file-level tree layer above individual document trees. The Agentic Vectorless RAG example demonstrates an end-to-end agentic pipeline using the OpenAI Agents SDK with self-hosted PageIndex, providing a minimal but complete reference implementation. The project is cited as: Mingtian Zhang, Yu Tang and PageIndex Team, "PageIndex: Next-Generation Vectorless, Reasoning-based RAG," PageIndex Blog, Sep 2025.
Who It Is For
PageIndex targets developers and enterprises working with long, complex professional documents — financial reports, regulatory filings, legal manuals, academic textbooks, and technical documentation that exceeds LLM context windows. The chat platform serves non-technical users who need verifiable, source-grounded answers from uploaded documents. The MCP and API interfaces serve developers integrating document intelligence into their own applications or agent pipelines.
Community Discussions
Be the first to start a conversation about PageIndex
Share your experience with PageIndex, ask questions, or help others learn from your insights.
Pricing
Open Source
Self-hosted open-source package available under the MIT License. Free to use, modify, and distribute.
- Full PageIndex source code under MIT License
- Standard PDF parsing
- Markdown document support
- Hierarchical tree index generation
- Reasoning-based retrieval
Capabilities
Key Features
- Vectorless RAG — no vector database or embeddings required
- Hierarchical tree index generation from PDF and Markdown documents
- Reasoning-based retrieval via LLM tree search
- Context-aware retrieval incorporating conversation history
- Page and section references for full traceability
- Agentic vectorless RAG with OpenAI Agents SDK
- PageIndex File System for corpus-scale search over millions of documents
- Vision-based RAG over PDF page images (no OCR)
- Multi-LLM support via LiteLLM
- Cloud service with enhanced OCR and tree-building pipeline
- MCP integration for developer workflows
- REST API access
- Chat platform for non-technical users
- Enterprise private/on-prem deployment option
