PageIndex：29K⭐Vectorless RAG System — Document Retrieval Without Vector Database · dibi8

{</* resource-info */>}

What is PageIndex? #

PageIndex is an open-source RAG (Retrieval-Augmented Generation) system developed by VectifyAI that fundamentally changes traditional document retrieval. Unlike conventional vector databases, PageIndex uses a reasoning-driven approach, achieving human-like retrieval by constructing hierarchical tree structures of documents.

🌲 Tree Structure Index — Organizes documents like a table of contents
🧠 Reasoning-Driven Retrieval — LLM reasoning instead of vector similarity
❌ No Vector Database Required — Eliminates expensive vector storage costs
❌ No Chunking Needed — Preserves natural document structure
📊 98.7% Accuracy — SOTA on FinanceBench benchmark

GitHub: https://github.com/VectifyAI/PageIndex
Stars: 29,202+ | Language: Python | License: Apache-2.0

Why Traditional RAG Isn’t Good Enough #

Problems with Traditional Vector RAG #

Problem	Explanation
Similarity ≠ Relevance	Vector search finds semantically similar content, but not necessarily truly relevant results
Chunking Destroys Structure	Forced chunking cuts through document logical structure
Black Box Retrieval	Vector search is unexplainable, can’t trace why this result was returned
High Costs	Requires maintaining vector databases with expensive storage and computation
Poor Performance on Long Documents	Low retrieval accuracy for professional long documents (financial reports, legal files)

PageIndex’s Solution #

PageIndex mimics how human experts read documents:

First look at the table of contents structure (tree index)
Reason which chapters should contain the answer based on the question
Deep dive into relevant chapters

Core Technical Principles #

1. Document Tree Structure Generation #

PageIndex converts PDFs into hierarchical tree structures:

{
  "title": "Financial Stability",
  "node_id": "0006",
  "start_index": 21,
  "end_index": 22,
  "summary": "The Federal Reserve monitors financial vulnerabilities...",
  "nodes": [
    {
      "title": "Monitoring Financial Vulnerabilities",
      "node_id": "0007",
      "start_index": 22,
      "end_index": 28
    },
    {
      "title": "Domestic and International Cooperation",
      "node_id": "0008",
      "start_index": 28,
      "end_index": 31
    }
  ]
}

2. Reasoning-Driven Tree Search #

When a user asks a question, the LLM will:

Understand the Question — Analyze query intent
Traverse Tree Structure — Reason which nodes likely contain the answer
Deep Dive into Relevant Nodes — Search for specific information in candidate nodes
Return Results — With citation sources (page numbers, chapters)

3. Monte Carlo Tree Search Inspired by AlphaGo #

PageIndex draws inspiration from AlphaGo, using tree search algorithms:

Selection — Choose the most promising nodes
Expansion — Expand child nodes
Evaluation — LLM evaluates node relevance
Backpropagation — Update node weights

Quick Start #

Installation #

pip install pageindex

Basic Usage #

from pageindex import PageIndex

# Initialize
pi = PageIndex()

# Load PDF
pi.load_pdf("financial_report.pdf")

# Query
result = pi.query("What are the main risks mentioned in Q3?")
print(result.answer)
print(result.sources)  # Citation sources

Advanced Configuration #

# Custom LLM
pi = PageIndex(
    llm="gpt-4",
    temperature=0.1,
    max_depth=5  # Tree search depth
)

# Batch processing
pi.load_pdfs(["report1.pdf", "report2.pdf", "report3.pdf"])
results = pi.batch_query([
    "What is the revenue growth?",
    "What are the risk factors?",
    "What is the cash flow situation?"
])

Performance Benchmarks #

FinanceBench Test Results #

Model	Accuracy	Notes
PageIndex + GPT-4	98.7%	SOTA
PageIndex + Claude-3	97.2%	Excellent
Traditional RAG + GPT-4	82.1%	Baseline
Traditional RAG + Claude-3	79.5%	Baseline

Comparison with Vector RAG #

Metric	PageIndex	Traditional Vector RAG
Indexing Speed	3x faster	Requires embedding computation
Storage Cost	90% reduction	Vector storage is expensive
Retrieval Accuracy	98.7%	~80%
Explainability	✅ Citation sources	❌ Black box
Long Document Support	✅ Native	❌ Requires chunking

Use Cases #

Financial Analysis #

Annual report analysis
Prospectus review
Risk assessment
Compliance checking

Legal Document Review #

Contract analysis
Case law research
Regulatory document review
Due diligence

Medical Literature #

Clinical trial reports
Drug instructions
Medical guidelines
Research paper review

Technical Documentation #

API documentation
System architecture docs
Operation manuals
Technical specifications

Architecture Design #

┌─────────────────────────────────────────┐
│           User Query                     │
└──────────────┬──────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────┐
│      Query Understanding (LLM)           │
│  - Intent analysis                       │
│  - Keyword extraction                    │
│  - Question classification               │
└──────────────┬──────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────┐
│      Tree Structure Traversal            │
│  - Node relevance scoring                │
│  - Pruning optimization                │
│  - Multi-path exploration                │
└──────────────┬──────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────┐
│      Content Retrieval                   │
│  - Precise positioning                   │
│  - Context expansion                     │
│  - Source marking                        │
└──────────────┬──────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────┐
│      Answer Generation (LLM)             │
│  - Information synthesis               │
│  - Answer structuring                    │
│  - Citation addition                     │
└─────────────────────────────────────────┘

Community & Ecosystem #

GitHub Stars: 29,202+
Contributors: 50+
Release Cycle: Weekly updates
Community: Active Discord channel

VectifyAI: Commercial version providing enterprise-grade support
PageIndex Hub: Community-contributed document templates
PageIndex CLI: Command-line tool for batch processing

Summary #

PageIndex represents a paradigm shift in document retrieval:

No vector database required, dramatically reducing costs
Reasoning-driven retrieval, more aligned with human thinking
Explainable results, every answer has traceable sources
High accuracy, reaching 98.7% on professional benchmarks

For scenarios requiring processing large volumes of professional documents (finance, law, medicine), PageIndex is an option worth prioritizing.