PageIndex: How Vectorless Reasoning-Based RAG Eliminates Vector DB Complexity & Boosts Retrieval Accuracy

Every data scientist who has built a traditional RAG (Retrieval-Augmented Generation) pipeline knows the ritual: dump your documents into chunks, generate embeddings, store them in ChromaDB or Pinecone, and hope the cosine similarity scores bring back what you actually need. Then comes the endless tuning — adjusting chunk sizes, tweaking embedding models, fusing BM25 with vectors, chasing that elusive balance between precision and recall. And even then, when a user asks “What were the Q3 risk factors for our derivatives portfolio?”, the system might return passages about marketing budgets from an unrelated section because they shared similar vocabulary. Vector similarity does not equal relevance.

This is the fundamental problem that PageIndex, an open-source project by Vectify AI, solves by completely rethinking how document retrieval works. With 30,297 GitHub stars and growing at an astonishing rate of 4,250 stars per week, PageIndex takes a radically different approach: instead of converting text into dense vector embeddings, it builds a hierarchical tree index of your documents and uses LLM reasoning to navigate that tree — mimicking how human experts extract knowledge from complex reports. The result is retrieval that is explainable, traceable, context-aware, and achieves state-of-the-art results including 98.7% accuracy on the FinanceBench benchmark.

Built on the philosophy that similarity ≠ relevance and that relevance requires reasoning, PageIndex represents a paradigm shift from approximate vector search to exact, reasoning-driven document navigation. Whether you’re analyzing SEC filings, reviewing legal contracts, scanning academic papers, or debugging technical manuals, this article will show you how PageIndex transforms the entire RAG landscape.


What Is PageIndex?

PageIndex is a vectorless, reasoning-based RAG system that replaces the traditional vector database pipeline with a document-aware, structure-preserving approach. Instead of fragmenting your PDFs into arbitrary chunks and embedding them into high-dimensional space, PageIndex constructs a semantic tree index — essentially an intelligent table of contents — that mirrors the logical structure of your document.

The core insight behind PageIndex draws inspiration from AlphaGo’s Monte Carlo Tree Search (MCTS). Just as AlphaGo explored a branching tree of possible moves to find the optimal path to victory, PageIndex explores a branching tree of document sections to find the optimal path to the relevant information. This “tree search” approach means the system doesn’t just match keywords or find similar vectors — it reasons through the document hierarchy to understand exactly which pages contain the answer to your question.

Traditional RAG vs. PageIndex: A Fundamental Difference

Traditional RAG operates on a simple principle: break text apart, embed it, and retrieve via nearest-neighbor search. PageIndex flips this entirely:

AspectTraditional RAG (ChromaDB/FAISS/Pinecone)PageIndex
Index TypeDense vector embeddingsHierarchical tree structure
Document UnitArtificial chunks (500-1000 tokens)Natural document sections
Retrieval MethodCosine similarity / ANN searchLLM reasoning over tree structure
ExplainabilityOpaque (“vibe retrieval”)Full traceability with page references
Context AwarenessStatic retrieval per queryDepends on conversation history
Human-like NavigationNoYes — simulates expert reading

When a user queries a 500-page financial report, a traditional RAG system might retrieve the top-5 most similar chunks based on embedding proximity. But those chunks could span dozens of unrelated pages, and there is no way to know whether the most relevant section was included in the top-5 candidates. With PageIndex, the LLM first examines the tree index, identifies which branches are most likely to contain the answer, and then traverses down only the relevant branches — just like a finance analyst would flip through a report to find the right chapter.

The implications for accuracy, speed, and cost are profound. By narrowing the search to relevant sections early, PageIndex reduces unnecessary token consumption while dramatically improving retrieval quality.


Core Features

PageIndex offers a suite of features designed specifically to address the limitations of vector-based RAG systems:

1. No Vector Database Needed

Unlike traditional RAG pipelines that require setting up and maintaining vector databases like ChromaDB, FAISS, Pinecone, or Weaviate, PageIndex eliminates the need for any specialized vector infrastructure. Your documents are processed directly by the LLM using their natural structure. This simplifies your deployment stack significantly — you need only an LLM API key and a Python environment. There are no vector indexes to rebuild, no dimensionality settings to tune, and no embedding model updates to synchronize with your indexed documents.

2. No Chunking

Chunking is arguably the most painful decision in any RAG implementation. Too small, and you lose context; too large, and you drown the LLM in irrelevant text. PageIndex circumvents this problem entirely by organizing documents into natural sections based on their inherent structure. Chapters, subsections, headings, and logical groupings become the indexing units — not arbitrary token boundaries. This preserves semantic coherence and ensures that retrieved sections contain complete, self-contained information.

3. Better Explainability and Traceability

One of the most criticized aspects of vector-based RAG is its opacity. When the system returns five relevant-looking chunks, developers often cannot explain why those specific chunks were chosen beyond “they had high cosine similarity.” PageIndex provides full traceability: every retrieval decision can be traced through the reasoning steps that led the LLM to select particular nodes in the tree. Results include precise page numbers and section references, making it trivial to verify that the retrieved content genuinely answers the query.

4. Context-Aware Retrieval

Traditional RAG treats each query in isolation. Even if you have a multi-turn conversation, the retrieval step typically doesn’t adapt based on earlier exchanges. PageIndex explicitly incorporates conversation history and domain knowledge into its reasoning process. If your second question follows up on something discussed in the first exchange, the retrieval engine understands the evolving context and adjusts its search accordingly. This makes PageIndex particularly powerful for conversational QA scenarios where meaning shifts across turns.

5. Human-Like Retrieval

The name “PageIndex” is deliberate — it evokes the act of flipping through pages and finding what you need through intuition and expertise. PageIndex simulates exactly this behavior: the LLM reads the tree index, forms hypotheses about where information lives, tests those hypotheses by navigating deeper into the tree, and refines its search iteratively. This human-like navigation pattern has proven remarkably effective for domain-specific tasks requiring deep analytical reasoning.

6. Financial Benchmark Leadership

PageIndex powers Mafin 2.5, a reasoning-based RAG system that achieved a groundbreaking 98.7% accuracy on the FinanceBench benchmark — a rigorous evaluation suite for financial document analysis. This state-of-the-art result significantly outperforms traditional vector-based RAG systems on tasks involving SEC filings, earnings reports, and regulatory disclosures. The FinanceBench leadership demonstrates that reasoning-based retrieval excels in domains where precision and accuracy are non-negotiable.


How PageIndex Works

Understanding PageIndex’s architecture requires looking at the two-phase process that underlies every retrieval operation: tree index generation followed by reasoning-based retrieval.

Phase 1: Generating the Tree Structure

When you provide a PDF document to PageIndex, the system processes it through the following pipeline:

PDF Input → Text Extraction → Section Detection → LLM Analysis → Tree Index Output
  1. Text Extraction: The PDF is parsed into raw text. PageIndex uses standard PDF parsing to extract text, headers, and structural elements from each page.

  2. Section Detection: The system analyzes the document layout to identify natural divisions — chapters, sections, subsections, lists, tables, and figures. For Markdown files, it uses heading markers (#, ##, ###) to determine structural levels.

  3. LLM-Powered Node Creation: An LLM examines each identified section and generates three critical pieces of metadata:

    • Title: A concise label for the section
    • Summary: A brief description of what the section contains
    • Page Range: The start and end page indices
  4. Hierarchical Assembly: Sections are nested into a parent-child tree structure. Chapter-level sections become root nodes, subsections become children, and so on. Each node carries its own metadata and may contain further child nodes.

Here is an example of the resulting tree structure:

{
  "title": "Financial Stability",
  "node_id": "0006",
  "start_index": 21,
  "end_index": 22,
  "summary": "The Federal Reserve assesses overall financial stability...",
  "nodes": [
    {
      "title": "Monitoring Financial Vulnerabilities",
      "node_id": "0007",
      "start_index": 22,
      "end_index": 28,
      "summary": "The Federal Reserve's monitoring framework evaluates systemic risks..."
    },
    {
      "title": "Domestic and International Cooperation",
      "node_id": "0008",
      "start_index": 28,
      "end_index": 31,
      "summary": "In 2023, the Federal Reserve collaborated with international partners..."
    }
  ]
}

This JSON structure serves as the “table of contents” that drives the entire retrieval process. The tree captures the semantic relationships between sections — parent nodes represent broader topics, while child nodes drill down into specific subtopics.

Phase 2: Reasoning-Based Two-Step Retrieval

Once the tree index exists, answering questions becomes a reasoning exercise rather than a similarity search:

Step 1 — Tree Traversal: When a query arrives (e.g., “What did the Federal Reserve do regarding international cooperation in 2023?”), the LLM first reads the tree index. It reasons about which nodes are most relevant, effectively simulating how an expert would skim a table of contents before deciding where to look. The LLM selects promising nodes and descends the tree recursively until it reaches leaf nodes containing the target content.

Step 2 — Content Retrieval: Once the relevant leaf node(s) are identified, PageIndex fetches the actual text content from the specified page ranges. This two-step approach means the LLM never needs to process irrelevant content — it narrows the search intelligently before fetching any text.

The beauty of this approach lies in its recursive refinement. The LLM doesn’t make a single binary decision — it continuously reassesses its hypotheses as it traverses the tree. If a child node seems irrelevant, the reasoning engine backtracks and explores sibling nodes. This iterative deepening mirrors how a skilled analyst would work through a document.

The File-System Level Tree Layer

For scenarios involving millions of documents, PageIndex extends its tree architecture to the file-system level. This file-level tree layer allows PageIndex to reason over entire corpora, not just individual documents. Each document maintains its own internal tree, and these trees are organized under a filesystem directory structure — creating a global search space that scales to massive document collections without losing the benefits of structured, reasoning-based retrieval.


Getting Started with PageIndex

Getting started with PageIndex is straightforward. The setup requires minimal dependencies and works with any OpenAI-compatible API provider through LiteLLM integration.

Step 1: Installation

First, clone the repository and install dependencies:

git clone https://github.com/VectifyAI/PageIndex.git
cd PageIndex
pip install -r requirements.txt

Step 2: Configure Your LLM API Key

Create a .env file in the root directory with your LLM API key. PageIndex uses LiteLLM for multi-provider support, meaning you can use OpenAI, Anthropic, Google Gemini, or any other provider compatible with LiteLLM’s unified interface:

OPENAI_API_KEY=your_api_key_here

Or for other providers:

ANTHROPIC_API_KEY=your_anthropic_key_here
GEMINI_API_KEY=your_gemini_key_here

Step 3: Run PageIndex on Your Document

For PDF documents:

python3 run_pageindex.py --pdf_path /path/to/your/document.pdf

For Markdown documents:

python3 run_pageindex.py --md_path /path/to/your/document.md

Optional Parameters

You can fine-tune the indexing process with several command-line arguments:

python3 run_pageindex.py \
  --pdf_path /path/to/your/document.pdf \
  --model gpt-4o-2024-11-20 \
  --toc-check-pages 20 \
  --max-pages-per-node 10 \
  --max-tokens-per-node 20000 \
  --if-add-node-id yes \
  --if-add-node-summary yes \
  --if-add-doc-description yes

Parameter breakdown:

ParameterDefaultDescription
--modelgpt-4o-2024-11-20LLM model used for tree generation and reasoning
--toc-check-pages20Number of initial pages checked for existing table of contents
--max-pages-per-node10Maximum pages allowed per tree node before splitting
--max-tokens-per-node20000Maximum tokens per tree node
--if-add-node-idyesWhether to assign unique IDs to tree nodes
--if-add-node-summaryyesWhether to generate summaries for each node
--if-add-doc-descriptionyesWhether to add a general document description

Viewing the Generated Index

After running the command, you’ll receive a JSON output showing the generated tree structure. Examine it to verify that the hierarchical organization matches the document’s logical flow. Review examples of generated tree structures in the repository’s examples/documents/results directory.


Agentic RAG Example with OpenAI Agents SDK

PageIndex shines brightest when integrated into an agentic workflow. The examples/agentic_vectorless_rag_demo.py file demonstrates a complete, end-to-end document QA agent powered by the OpenAI Agents SDK.

Setting Up the Agentic Demo

First, install the optional OpenAI Agents SDK dependency:

pip3 install openai-agents

Then run the demo:

python3 examples/agentic_vectorless_rag_demo.py

The demo loads an attention-residuals PDF, generates its tree index, and creates an agent capable of answering questions about the document through tool-use reasoning.

Understanding the Agent Architecture

The agentic agent defines three tools:

  • get_document(): Returns document metadata (status, page count, name, description)
  • get_document_structure(): Returns the full tree structure index for identifying relevant page ranges
  • get_page_content(pages): Retrieves text content from specific pages using tight ranges (e.g., "5-7" for pages 5 to 7, "3,8" for pages 3 and 8)

The agent follows a strict reasoning protocol:

AGENT_SYSTEM_PROMPT = """
You are PageIndex, a document QA assistant.
TOOL USE:
- Call get_document() first to confirm status and page/line count.
- Call get_document_structure() to identify relevant page ranges.
- Call get_page_content(pages="5-7") with tight ranges; never fetch the whole document.
- Before each tool call, output one short sentence explaining the reason.
Answer based only on tool output. Be concise.
"""

This prompt enforces disciplined tool usage. The agent must first inspect the document metadata, then examine the tree structure, then fetch only the narrowest possible page range. At no point does it waste tokens retrieving irrelevant content.

Real-Agent Interaction Pattern

When you ask a question, here’s what happens step-by-step:

User: "What are residual connections and why are they important?"

Agent reasoning:
→ Calls get_document() — confirms document has 18 pages
→ Calls get_document_structure() — identifies nodes covering "attention mechanisms"
and "residual connections" on pages 3-8
→ Calls get_page_content(pages="3-8") — fetches targeted content
→ Synthesizes answer from retrieved sections only

This demonstrates the core advantage of agentic vectorless RAG: the agent decides what to read based on the tree structure, rather than blindly loading pre-extracted chunks. The reasoning loop produces precise, well-sourced answers while minimizing token consumption.


Deployment Options

PageIndex supports multiple deployment strategies depending on your scale, privacy requirements, and operational needs:

Self-Hosted (Open Source)

Run PageIndex locally using the open-source repository. This option provides full control over processing, ideal for development, research, or privacy-sensitive environments:

git clone https://github.com/VectifyAI/PageIndex.git
cd PageIndex
pip install -r requirements.txt
python3 run_pageindex.py --pdf_path your_document.pdf

The self-hosted version uses standard PDF parsing and your own LLM API keys. It’s free, fully auditable, and suitable for most individual and team use cases.

Cloud Service (MCP + API)

For production workloads with enhanced capabilities, Vectify AI offers a cloud service featuring:

  • Enhanced OCR for complex PDF layouts, scanned documents, and image-heavy content
  • Improved tree building with advanced structural analysis
  • Optimized retrieval tuned for accuracy and speed

Access the cloud service through:

The cloud service handles the heavy lifting of document processing, freeing you to focus on building applications rather than managing infrastructure.

Enterprise Deployment

For organizations requiring private or on-premise deployment, PageIndex offers enterprise-grade solutions. Contact Vectify AI through their contact form or schedule a demo to discuss custom deployment architectures, including dedicated infrastructure, SLA guarantees, and compliance certifications.


Comparison with Traditional RAG Systems

To understand where PageIndex fits in the broader ecosystem, compare it against the dominant approaches in the vector-based RAG world:

FeaturePageIndexChromaDBFAISSPinecone
Index TypeHierarchical tree structureDense vectors (HNSW)Binary/scalar vectors (IVF/PQ)Managed dense vectors
Vector DB RequiredNoYesYesYes (managed)
Document ChunkingNo — natural sectionsYes — requiredYes — requiredYes — required
Retrieval MechanismLLM reasoning over treeCosine similarityApproximate NN searchCosine similarity
ExplainabilityFull traceabilityOpaque similarity scoresOpaqueOpaque
Context AwarenessMulti-turn awareSingle-querySingle-querySingle-query
Human-like NavigationSimulates expert readingNoNoNo
Max Document ScaleMillions (filesystem tree)Hundreds of thousandsBillionsHundreds of millions
Setup ComplexityLow (Python script)Medium (DB config)High (tuning params)Medium (cloud console)
Cost per QueryTokens for reasoningMinimalMinimalCloud pricing
LicenseOpen sourceApache 2.0BSDCommercial
Multi-Provider LLMVia LiteLLMN/A (embedding-dependent)N/AN/A

Key takeaways from this comparison:

  1. No Infrastructure Overhead: PageIndex eliminates the entire vector database layer — no Docker containers, no managed service subscriptions, no index rebuilds after document updates.

  2. Accuracy Through Reasoning: Where vector systems optimize for embedding-space proximity, PageIndex optimizes for semantic correctness through deliberate reasoning. The FinanceBench results validate this approach.

  3. Scalability Parity: The filesystem-level tree layer enables PageIndex to handle millions of documents with complexity comparable to optimized vector indexes, while maintaining the interpretability advantages of tree search.

  4. Flexibility: LiteLLM integration means you’re not locked into any single LLM provider. Switch between OpenAI, Anthropic, or open-weight models without changing your PageIndex configuration.


Real-World Use Cases

PageIndex’s reasoning-based approach excels in domains where documents demand careful, structured analysis:

Financial Analysis

PageIndex’s 98.7% FinanceBench accuracy isn’t incidental — it demonstrates why reasoning-based retrieval matters for financial document analysis. SEC filings, 10-K annual reports, earnings call transcripts, and regulatory disclosures contain nuanced relationships between data points spread across hundreds of pages. A question about “material risks related to interest rate sensitivity in the second quarter” requires the system to understand temporal references, cross-reference quarterly data, and distinguish between forward-looking statements and historical facts. Vector similarity alone struggles with this depth of reasoning. PageIndex’s tree traversal naturally captures these relationships.

Legal professionals routinely analyze contracts, court opinions, and regulatory documents spanning thousands of pages. The ability to trace a retrieval decision back to a specific clause, section, or paragraph — with page-level precision — is invaluable for legal due diligence, contract review, and precedent research. PageIndex’s explainability feature means lawyers can verify that retrieved passages truly support their legal arguments.

Academic Paper Analysis

Researchers working with arXiv papers, journal articles, and dissertation repositories benefit from PageIndex’s section-aware retrieval. Unlike vector search that might mix methodology sections with literature reviews, PageIndex’s hierarchical index preserves the distinction between abstract, introduction, methods, results, and conclusion — ensuring accurate retrieval for academic queries. The examples/documents directory includes attention mechanism papers demonstrating this capability.

Technical Documentation & Knowledge Bases

Enterprise knowledge bases filled with API documentation, troubleshooting guides, and architectural decisions require retrieval that respects document topology. PageIndex can index entire documentation sets using the filesystem tree layer, allowing users to navigate from broad topic areas down to specific code examples or configuration parameters with the same precision as an experienced developer browsing documentation.


Limitations and Considerations

While PageIndex offers compelling advantages, it’s important to understand its current limitations:

Latency Trade-Off

Tree index generation requires LLM inference — each document must pass through an LLM to build its hierarchical structure. For very large document batches, this upfront cost may exceed the latency of vector indexing. However, the index is built once and queried many times, so amortized costs are favorable for repeatedly accessed documents.

Dependency on LLM Quality

Since PageIndex relies on LLM reasoning throughout its pipeline, the quality of responses depends on the underlying model. While LiteLLM integration allows switching between models (including local/open-weight alternatives), weaker models may produce less accurate tree structures or poorer reasoning during retrieval.

Image and Complex Layout Handling

The self-hosted version uses standard PDF parsing, which works well for text-heavy documents but may struggle with highly formatted PDFs containing complex tables, charts, or mixed media. For such cases, the cloud service’s enhanced OCR pipeline is recommended.

PageIndex is best suited for structured, professional documents where section boundaries are meaningful. For ad-hoc text corpora without clear hierarchical structure, vector-based approaches may still offer practical advantages. The two paradigms can complement each other in hybrid architectures.

Emerging Technology

PageIndex is actively evolving with 283+ commits and rapid community adoption. While the core features are mature, edge cases and novel document types may surface unanticipated challenges. Teams adopting PageIndex should monitor the release notes and participate in the community for the latest developments.


Conclusion

PageIndex represents a fundamental rethink of how we retrieve information from documents. By replacing vector embeddings with hierarchical tree indexing and approximating-nearest-neighbor search with deliberate LLM reasoning, it achieves results that challenge the assumptions underlying decades of IR research. The 98.7% FinanceBench accuracy, the human-like navigation patterns, and the full traceability of retrieval decisions demonstrate that reasoning-based retrieval is not just a theoretical alternative — it’s a practical, high-performance solution for real-world document intelligence.

As the AI industry matures, tools like PageIndex remind us that better retrieval doesn’t always mean more complex models or larger vector indexes. Sometimes, the most powerful advancement is a simpler idea executed brilliantly: build a map of your document, then reason your way through it just like a human would. With its MIT license, growing community of over 30,000 stars, and multi-LLM flexibility through LiteLLM integration, PageIndex is positioned to reshape how organizations think about document search, knowledge management, and RAG-based AI applications.

Whether you’re building a financial analysis platform, a legal research tool, an academic search engine, or simply want to stop fighting with chunk-size hyperparameters in your next RAG project, PageIndex offers a refreshingly principled alternative that puts reasoning above approximation.



Last updated: 2026-05-09. PageIndex is actively developed by Vectify AI; check the official repository for the latest features, releases, and community contributions.