PageIndex

Name: PageIndex
Availability: OnlineOnly
Author: Vectify AI

Vectorless, reasoning-based RAG system that builds hierarchical tree indexes from long documents and uses LLM reasoning for context-aware retrieval — no vector DB or chunking required.

Visit Website

At a Glance

Pricing

Open Source

Self-hosted open-source package available under the MIT License. Free to use, modify, and distribute.

Engagement

Available On

Web

API

CLI

SDK

Vectify AIRuislip, United KingdomEst. 2023$1.42M raised

Listed May 2026

About PageIndex

PageIndex is an open-source RAG framework developed by Vectify AI that replaces vector similarity search with LLM-driven tree search over structured document indexes. It is available as a self-hosted Python package, a cloud chat platform, and an MCP/API service for developers and enterprises. The project is authored by a team of AI researchers from UCL and Oxford with backgrounds at Anthropic and UiPath.

What It Is

PageIndex is a vectorless, reasoning-based retrieval-augmented generation (RAG) system. Instead of embedding documents into vector space and retrieving by cosine similarity, it builds a hierarchical "table of contents" tree index from a PDF or Markdown document, then uses an LLM to reason over that tree — simulating how a human expert would navigate a complex document. The core insight is that similarity ≠ relevance: professional documents require multi-step reasoning to find the right section, not approximate nearest-neighbor lookup.

Retrieval works in two steps:

Index generation: the document is parsed into a semantic tree of titled nodes, each with a page range and LLM-generated summary.
Tree search: at query time, an LLM reasons over the tree to identify and retrieve the most relevant nodes, incorporating full conversation history and domain context.

Architecture: No Vectors, No Chunks

Traditional RAG pipelines split documents into fixed-size chunks, embed them, and store them in a vector database. PageIndex discards all three of those steps. Documents are organized into natural sections that mirror the document's own structure. Retrieval is traceable — every answer cites the specific page and section from which it was drawn, making results interpretable rather than opaque. The open-source package supports standard PDF parsing and Markdown files; the cloud service adds enhanced OCR and a more robust tree-building pipeline for complex PDFs.

Key architectural properties:

No vector database dependency
No chunking — sections follow document structure
Context-aware retrieval that incorporates conversation history
Page and section references for full traceability
Multi-LLM support via LiteLLM (OpenAI, and other providers)

Deployment Options

PageIndex offers three deployment paths:

Self-hosted: run the open-source Python package locally with standard PDF parsing; install via pip and point at any PDF or Markdown file.
Cloud service: production-grade pipeline with enhanced OCR, accessible via the PageIndex Chat platform, MCP integration, or REST API.
Enterprise: private or on-premises deployment; contact the team for details.

The self-hosted path requires an LLM API key (e.g., OpenAI) and a few CLI commands. The cloud service is accessible immediately through the chat interface without any setup.

Performance Signal: FinanceBench

The PageIndex team reports that Mafin 2.5 — a reasoning-based RAG system for financial document analysis powered by PageIndex — achieved 98.7% accuracy on the FinanceBench benchmark, which tests question answering over SEC filings and earnings disclosures. The team attributes this result to PageIndex's hierarchical indexing and reasoning-driven retrieval, which they claim significantly outperforms traditional vector-based RAG on this benchmark. Full benchmark results are published in the VectifyAI/Mafin2.5-FinanceBench GitHub repository.

Update: Agentic Vectorless RAG and PageIndex File System

Recent updates to the project include two notable additions. The PageIndex File System extends the tree index to corpus-level search, allowing PageIndex to reason over millions of documents rather than a single file by adding a file-level tree layer above individual document trees. The Agentic Vectorless RAG example demonstrates an end-to-end agentic pipeline using the OpenAI Agents SDK with self-hosted PageIndex, providing a minimal but complete reference implementation. The project is cited as: Mingtian Zhang, Yu Tang and PageIndex Team, "PageIndex: Next-Generation Vectorless, Reasoning-based RAG," PageIndex Blog, Sep 2025.

Who It Is For

PageIndex targets developers and enterprises working with long, complex professional documents — financial reports, regulatory filings, legal manuals, academic textbooks, and technical documentation that exceeds LLM context windows. The chat platform serves non-technical users who need verifiable, source-grounded answers from uploaded documents. The MCP and API interfaces serve developers integrating document intelligence into their own applications or agent pipelines.

Community Discussions

Be the first to start a conversation about PageIndex

Share your experience with PageIndex, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Self-hosted open-source package available under the MIT License. Free to use, modify, and distribute.

Full PageIndex source code under MIT License
Standard PDF parsing
Markdown document support
Hierarchical tree index generation
Reasoning-based retrieval

Capabilities

Key Features

Vectorless RAG — no vector database or embeddings required
Hierarchical tree index generation from PDF and Markdown documents
Reasoning-based retrieval via LLM tree search
Context-aware retrieval incorporating conversation history
Page and section references for full traceability
Agentic vectorless RAG with OpenAI Agents SDK
PageIndex File System for corpus-scale search over millions of documents
Vision-based RAG over PDF page images (no OCR)
Multi-LLM support via LiteLLM
Cloud service with enhanced OCR and tree-building pipeline
MCP integration for developer workflows
REST API access
Chat platform for non-technical users
Enterprise private/on-prem deployment option

Integrations

OpenAI

LiteLLM

OpenAI Agents SDK

MCP (Model Context Protocol)

REST API

API Available

View Docs

Back to all tools Suggest an edit

PageIndex

Retrieval-Augmented Generation

Vectorless, reasoning-based RAG system that builds hierarchical tree indexes from long documents and uses LLM reasoning for context-aware retrieval — no vector DB or chunking required.

Visit Website

At a Glance

Pricing

Open Source

Self-hosted open-source package available under the MIT License. Free to use, modify, and distribute.

Engagement

ratings

discussions

8views

Available On

Web

API

CLI

SDK

Resources

Website Docs GitHub llms.txt

Topics

Retrieval-Augmented Generation AI Development Libraries Document Management

Alternatives

RAGFlow Agentset Haystack

Developer

Vectify AIRuislip, United KingdomEst. 2023$1.42M raised

Listed May 2026

About PageIndex

What It Is

Retrieval works in two steps:

Index generation: the document is parsed into a semantic tree of titled nodes, each with a page range and LLM-generated summary.
Tree search: at query time, an LLM reasons over the tree to identify and retrieve the most relevant nodes, incorporating full conversation history and domain context.

Architecture: No Vectors, No Chunks

Key architectural properties:

No vector database dependency
No chunking — sections follow document structure
Context-aware retrieval that incorporates conversation history
Page and section references for full traceability
Multi-LLM support via LiteLLM (OpenAI, and other providers)

Deployment Options

PageIndex offers three deployment paths:

Self-hosted: run the open-source Python package locally with standard PDF parsing; install via pip and point at any PDF or Markdown file.
Cloud service: production-grade pipeline with enhanced OCR, accessible via the PageIndex Chat platform, MCP integration, or REST API.
Enterprise: private or on-premises deployment; contact the team for details.

The self-hosted path requires an LLM API key (e.g., OpenAI) and a few CLI commands. The cloud service is accessible immediately through the chat interface without any setup.

Performance Signal: FinanceBench

Update: Agentic Vectorless RAG and PageIndex File System

Who It Is For

Community Discussions

Be the first to start a conversation about PageIndex

Share your experience with PageIndex, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Self-hosted open-source package available under the MIT License. Free to use, modify, and distribute.

Full PageIndex source code under MIT License
Standard PDF parsing
Markdown document support
Hierarchical tree index generation
Reasoning-based retrieval

Capabilities

Key Features

Vectorless RAG — no vector database or embeddings required
Hierarchical tree index generation from PDF and Markdown documents
Reasoning-based retrieval via LLM tree search
Context-aware retrieval incorporating conversation history
Page and section references for full traceability
Agentic vectorless RAG with OpenAI Agents SDK
PageIndex File System for corpus-scale search over millions of documents
Vision-based RAG over PDF page images (no OCR)
Multi-LLM support via LiteLLM
Cloud service with enhanced OCR and tree-building pipeline
MCP integration for developer workflows
REST API access
Chat platform for non-technical users
Enterprise private/on-prem deployment option

Integrations

OpenAI

LiteLLM

OpenAI Agents SDK

MCP (Model Context Protocol)

REST API

API Available

View Docs

Back to all tools Suggest an edit