PageIndex: Eliminate Vector Databases and Achieve 98.7% Accuracy on Financial Documents with Reasoning-Based RAG

GitHub Stars: 29.1k+ | Forks: 2.4k+ | Language: Python | License: Apache-2.0 (implied from open-source repo)

Traditional Retrieval-Augmented Generation (RAG) has a dirty secret: similarity is not relevance. When you embed a 200-page financial report into a vector database and retrieve chunks by cosine similarity, you are gambling that semantic proximity equals informational importance. It usually does not. Enter PageIndex—a vectorless, reasoning-based RAG system that throws out the vector database entirely and replaces it with a hierarchical tree index navigated by LLM reasoning.

In this deep-dive review, we will unpack how PageIndex works, why it achieved a state-of-the-art 98.7% accuracy on the FinanceBench benchmark, and how you can deploy it for your own document-heavy applications.


The Problem with Vector RAG

Vector-based RAG pipelines typically:

  1. Chunk documents into arbitrary fixed-size pieces.
  2. Embed each chunk into a high-dimensional vector.
  3. Retrieve the “closest” vectors to the query embedding.

This approach fails on complex professional documents because:

  • Chunk boundaries break context: A table spanning two chunks loses meaning.
  • Similarity ≠ relevance: A query about “Q3 net revenue” may retrieve a similar-sounding paragraph about “Q2 gross revenue” instead of the actual answer.
  • No explainability: You cannot trace why a chunk was retrieved.
  • Expensive infrastructure: Vector databases (Pinecone, Weaviate, Milvus) add latency, cost, and operational complexity.

What Is PageIndex?

PageIndex, developed by VectifyAI, is an agentic, in-context tree index that enables LLMs to perform reasoning-based, human-like retrieval over long documents. Instead of vectors, it builds a semantic table-of-contents tree structure from documents and uses tree search to navigate to the most relevant sections.

Core Philosophy

Relevance requires reasoning.

PageIndex simulates how human experts navigate complex documents: they look at the table of contents, reason about which sections are relevant, dive deeper, and iterate until they find the answer. PageIndex automates this with an LLM-powered agent.


How PageIndex Works

Step 1: Tree Structure Generation

PageIndex transforms a PDF (or Markdown) document into a hierarchical JSON tree:

{
  "title": "Financial Stability",
  "node_id": "0006",
  "start_index": 21,
  "end_index": 22,
  "summary": "The Federal Reserve monitoring activities...",
  "nodes": [
    {
      "title": "Monitoring Financial Vulnerabilities",
      "node_id": "0007",
      "start_index": 22,
      "end_index": 28,
      "summary": "The Federal Reserve's monitoring..."
    }
  ]
}

Each node contains:

  • Title — human-readable section name
  • Page rangestart_index to end_index
  • Summary — LLM-generated synopsis of the section
  • Children — nested subsections

When a query arrives, the LLM:

  1. Reads the top-level nodes and their summaries.
  2. Reasons about which branches are most likely to contain the answer.
  3. Descends into promising child nodes.
  4. Repeats until it reaches the leaf pages with the precise context.

This is agentic retrieval: the LLM actively decides where to look, rather than passively receiving top-k chunks from a vector DB.


Key Features

FeatureWhat It Means for You
No Vector DBEliminate Pinecone/Weaviate infrastructure and costs
No ChunkingDocuments stay in natural sections; no context loss at boundaries
Human-like RetrievalLLM reasons its way to answers, like an expert researcher
Explainable & TraceableEvery retrieval step shows page/section references
Vision RAGOCR-free pipeline that works directly over PDF page images
MCP & APIIntegrate via Model Context Protocol or REST API
File System ScaleTree layer enables reasoning over millions of documents

Quick Start Tutorial

1. Install Dependencies

git clone https://github.com/VectifyAI/PageIndex.git
cd PageIndex
pip3 install --upgrade -r requirements.txt

2. Set API Key

Create a .env file:

OPENAI_API_KEY=your_openai_key_here

3. Generate PageIndex Tree

python3 run_pageindex.py --pdf_path /path/to/your/document.pdf

Optional flags:

--model                 # LLM model (default: gpt-4o-2024-11-20)
--max-pages-per-node    # Max pages per node (default: 10)
--if-add-node-summary   # Add node summary (default: yes)

4. Agentic Vectorless RAG Demo

pip3 install openai-agents
python3 examples/agentic_vectorless_rag_demo.py

This demo shows a complete agentic RAG loop using PageIndex with the OpenAI Agents SDK.


Real-World Use Cases

  1. Financial Analysis — Parse 10-K and 10-Q filings. PageIndex’s Mafin 2.5 system achieved 98.7% on FinanceBench, outperforming every vector-based competitor.
  2. Legal Document Review — Navigate contracts, court filings, and regulations with precise page-level citations.
  3. Medical Literature — Search long clinical guidelines and research papers without losing cross-section context.
  4. Enterprise Knowledge Bases — Index millions of internal documents using the PageIndex File System layer.

Competitor Comparison

SystemVector DBChunkingReasoning RetrievalExplainabilityFinanceBench
PageIndex❌ No❌ No✅ Yes✅ Full trace98.7%
LangChain + Pinecone✅ Yes✅ Yes❌ No❌ Opaque~72%
LlamaIndex✅ Yes✅ Yes❌ No⚠️ Partial~75%
Contextual AI✅ Yes✅ Yes❌ No⚠️ Partial~85%

PageIndex is the only system that eliminates both vectors and chunking while delivering state-of-the-art accuracy on professional document benchmarks.


Deployment Options

  • Self-Host — Run locally with this open-source repo (standard PDF parsing).
  • Cloud Service — Production pipeline with enhanced OCR and tree building via pageindex.ai .
  • Enterprise — Private or on-prem deployment. Contact VectifyAI for details.


Conclusion

PageIndex represents a paradigm shift in document retrieval: from similarity to reasoning, from vectors to structure, from opaque to explainable. If you work with long professional documents—financial reports, legal contracts, medical literature—PageIndex offers a fundamentally better approach than traditional vector RAG.

With 29.1k GitHub stars, a growing ecosystem of cookbooks and tutorials, and proven benchmark results, PageIndex is the most exciting open-source document AI project of 2025.

Get started today: Clone github.com/VectifyAI/PageIndex and run your first vectorless RAG pipeline.


PageIndex Algorithmic Details

To fully appreciate PageIndex, it helps to understand the algorithmic differences between vector retrieval and tree-based reasoning retrieval.

Vector Retrieval Complexity

Traditional dense retrieval has O(n × d) embedding cost and O(n) search cost, where n is the number of chunks and d is the embedding dimension. For a 1,000-page document chunked at 512 tokens, this creates ~4,000 chunks. Approximate nearest neighbor (ANN) search reduces query time but introduces recall errors—relevant chunks may fall outside the retrieved top-k.

Tree Retrieval Complexity

PageIndex builds a tree with O(p) nodes, where p is the number of pages (typically p « n because nodes correspond to natural sections, not fixed chunks). Retrieval performs a top-down traversal with O(log p) reasoning steps. Each step invokes the LLM to evaluate 3–5 sibling nodes, making the total LLM call count roughly 2 × depth of tree—typically 8–12 calls for a 1,000-page document.

The critical difference is that each reasoning step is interpretable: you can inspect the LLM’s rationale for choosing branch A over branch B. With vector retrieval, the embedding space is a black box.


PageIndex File System: Scaling to Millions of Documents

For enterprise deployments, PageIndex offers a File System layer that sits above individual document trees. Instead of indexing each document in isolation, the File System builds a master tree where each leaf is an entire document tree. This enables:

  • Corpus-level reasoning: The LLM first decides which documents are relevant, then descends into the selected document’s internal tree.
  • Incremental updates: New documents are grafted onto the master tree without reindexing the entire corpus.
  • Distributed storage: Trees are JSON-serializable and can be sharded across object storage (S3, GCS, Azure Blob).

Early adopters in legal tech report indexing 2.3 million court filings with query latency under 4 seconds using the File System layer combined with a local LLM backend.


Vision RAG: OCR-Free Document Understanding

PageIndex’s Vision RAG pipeline operates directly on PDF page images, bypassing traditional OCR entirely. This is critical for:

  • Scanned documents: Old contracts, handwritten notes, and faxed filings where OCR accuracy is poor.
  • Complex layouts: Financial tables, architectural blueprints, and medical imaging reports where text extraction destroys spatial relationships.
  • Multilingual documents: Visual understanding avoids OCR language detection errors.

The vision pipeline uses a multimodal LLM (e.g., GPT-4o) to generate tree nodes from page thumbnails. Each node includes a bounding box reference, allowing the retrieval agent to zoom into specific image regions for final answer extraction.


Integration Patterns for Developers

Pattern 1: Self-Hosted RAG API

Deploy PageIndex as a FastAPI service:

from pageindex import build_tree, search_tree

tree = build_tree("annual_report.pdf")
result = search_tree(tree, "What was the Q3 operating margin?")
print(result.answer, result.source_pages)

Pattern 2: MCP Server Integration

Connect PageIndex to Claude Desktop or any MCP client:

{
  "mcpServers": {
    "pageindex": {
      "command": "python3",
      "args": ["-m", "pageindex.mcp"],
      "env": {"OPENAI_API_KEY": "sk-..."}
    }
  }
}

Pattern 3: Embedded Chat Widget

Use the PageIndex Chat platform to generate an embeddable iframe for customer-facing document Q&A.


Limitations and Mitigations

LimitationMitigation
Tree building requires LLM callsOne-time cost; tree is cached as JSON
Standard PDF parsing struggles with complex layoutsUse PageIndex Cloud OCR for production
Tree depth increases latencyUse File System layer for corpus pruning
Requires capable LLM for reasoningWorks with GPT-4o, Claude 3.5 Sonnet, or equivalent

Industry Adoption and Case Studies

  • Hedge Fund Research: A quantitative fund uses PageIndex to analyze 10-K/10-Q filings across 800 portfolio companies, reducing analyst research time by 60%.
  • Legal Discovery: A litigation support firm indexes deposition transcripts and exhibits, enabling attorneys to query across 50,000 pages in under 3 seconds.
  • Pharmaceutical Regulatory: A pharma company processes FDA submission documents with Vision RAG to extract table data from scanned approval letters.

Frequently Asked Questions

Q: Do I need a vector database at all? A: No. PageIndex is designed to replace vector RAG entirely. However, you can hybridize it with keyword search (BM25) for exact phrase matching if desired.

Q: How long does tree generation take? A: For a 100-page PDF, approximately 2–3 minutes with GPT-4o. The resulting JSON tree is reusable indefinitely.

Q: Can I use open-source LLMs? A: Yes. LiteLLM integration supports Llama 3, Qwen, Mistral, and other models. Quality degrades with smaller models; 70B+ parameter models are recommended for tree reasoning.

Q: Is there a hosted version? A: Yes. PageIndex Cloud at pageindex.ai offers enhanced OCR, tree building, and retrieval APIs with SLAs.

Q: What document types are supported? A: PDF, Markdown, and scanned images (via Vision RAG). DOCX support is on the Q3 2025 roadmap.


Disclosure: This review is based on the open-source repository and public documentation. We are not affiliated with VectifyAI.