Skip to main content

PageIndex:29K⭐Vectorless RAG System — Document Retrieval Without Vector Database

PageIndex is a vectorless, reasoning-driven RAG system open-sourced by VectifyAI. 29K+ Stars, achieves human-like retrieval through document tree structures, reaching 98.7% accuracy on FinanceBench.

Python
应用领域: Llm Frameworks

{</* resource-info */>}

What is PageIndex? #

PageIndex is an open-source RAG (Retrieval-Augmented Generation) system developed by VectifyAI that fundamentally changes traditional document retrieval. Unlike conventional vector databases, PageIndex uses a reasoning-driven approach, achieving human-like retrieval by constructing hierarchical tree structures of documents.

  • 🌲 Tree Structure Index — Organizes documents like a table of contents
  • 🧠 Reasoning-Driven Retrieval — LLM reasoning instead of vector similarity
  • No Vector Database Required — Eliminates expensive vector storage costs
  • No Chunking Needed — Preserves natural document structure
  • 📊 98.7% Accuracy — SOTA on FinanceBench benchmark

GitHub: https://github.com/VectifyAI/PageIndex
Stars: 29,202+ | Language: Python | License: Apache-2.0


Why Traditional RAG Isn’t Good Enough #

Problems with Traditional Vector RAG #

ProblemExplanation
Similarity ≠ RelevanceVector search finds semantically similar content, but not necessarily truly relevant results
Chunking Destroys StructureForced chunking cuts through document logical structure
Black Box RetrievalVector search is unexplainable, can’t trace why this result was returned
High CostsRequires maintaining vector databases with expensive storage and computation
Poor Performance on Long DocumentsLow retrieval accuracy for professional long documents (financial reports, legal files)

PageIndex’s Solution #

PageIndex mimics how human experts read documents:

  1. First look at the table of contents structure (tree index)
  2. Reason which chapters should contain the answer based on the question
  3. Deep dive into relevant chapters

Core Technical Principles #

1. Document Tree Structure Generation #

PageIndex converts PDFs into hierarchical tree structures:

{
  "title": "Financial Stability",
  "node_id": "0006",
  "start_index": 21,
  "end_index": 22,
  "summary": "The Federal Reserve monitors financial vulnerabilities...",
  "nodes": [
    {
      "title": "Monitoring Financial Vulnerabilities",
      "node_id": "0007",
      "start_index": 22,
      "end_index": 28
    },
    {
      "title": "Domestic and International Cooperation",
      "node_id": "0008",
      "start_index": 28,
      "end_index": 31
    }
  ]
}

When a user asks a question, the LLM will:

  1. Understand the Question — Analyze query intent
  2. Traverse Tree Structure — Reason which nodes likely contain the answer
  3. Deep Dive into Relevant Nodes — Search for specific information in candidate nodes
  4. Return Results — With citation sources (page numbers, chapters)

3. Monte Carlo Tree Search Inspired by AlphaGo #

PageIndex draws inspiration from AlphaGo, using tree search algorithms:

  • Selection — Choose the most promising nodes
  • Expansion — Expand child nodes
  • Evaluation — LLM evaluates node relevance
  • Backpropagation — Update node weights

Quick Start #

Installation #

pip install pageindex

Basic Usage #

from pageindex import PageIndex

# Initialize
pi = PageIndex()

# Load PDF
pi.load_pdf("financial_report.pdf")

# Query
result = pi.query("What are the main risks mentioned in Q3?")
print(result.answer)
print(result.sources)  # Citation sources

Advanced Configuration #

# Custom LLM
pi = PageIndex(
    llm="gpt-4",
    temperature=0.1,
    max_depth=5  # Tree search depth
)

# Batch processing
pi.load_pdfs(["report1.pdf", "report2.pdf", "report3.pdf"])
results = pi.batch_query([
    "What is the revenue growth?",
    "What are the risk factors?",
    "What is the cash flow situation?"
])

Performance Benchmarks #

FinanceBench Test Results #

ModelAccuracyNotes
PageIndex + GPT-498.7%SOTA
PageIndex + Claude-397.2%Excellent
Traditional RAG + GPT-482.1%Baseline
Traditional RAG + Claude-379.5%Baseline

Comparison with Vector RAG #

MetricPageIndexTraditional Vector RAG
Indexing Speed3x fasterRequires embedding computation
Storage Cost90% reductionVector storage is expensive
Retrieval Accuracy98.7%~80%
Explainability✅ Citation sources❌ Black box
Long Document Support✅ Native❌ Requires chunking

Use Cases #

Financial Analysis #

  • Annual report analysis
  • Prospectus review
  • Risk assessment
  • Compliance checking
  • Contract analysis
  • Case law research
  • Regulatory document review
  • Due diligence

Medical Literature #

  • Clinical trial reports
  • Drug instructions
  • Medical guidelines
  • Research paper review

Technical Documentation #

  • API documentation
  • System architecture docs
  • Operation manuals
  • Technical specifications

Architecture Design #

┌─────────────────────────────────────────┐
│           User Query                     │
└──────────────┬──────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────┐
│      Query Understanding (LLM)           │
│  - Intent analysis                       │
│  - Keyword extraction                    │
│  - Question classification               │
└──────────────┬──────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────┐
│      Tree Structure Traversal            │
│  - Node relevance scoring                │
│  - Pruning optimization                │
│  - Multi-path exploration                │
└──────────────┬──────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────┐
│      Content Retrieval                   │
│  - Precise positioning                   │
│  - Context expansion                     │
│  - Source marking                        │
└──────────────┬──────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────┐
│      Answer Generation (LLM)             │
│  - Information synthesis               │
│  - Answer structuring                    │
│  - Citation addition                     │
└─────────────────────────────────────────┘

Community & Ecosystem #

  • GitHub Stars: 29,202+
  • Contributors: 50+
  • Release Cycle: Weekly updates
  • Community: Active Discord channel
  • VectifyAI: Commercial version providing enterprise-grade support
  • PageIndex Hub: Community-contributed document templates
  • PageIndex CLI: Command-line tool for batch processing

Summary #

PageIndex represents a paradigm shift in document retrieval:

  • No vector database required, dramatically reducing costs
  • Reasoning-driven retrieval, more aligned with human thinking
  • Explainable results, every answer has traceable sources
  • High accuracy, reaching 98.7% on professional benchmarks

For scenarios requiring processing large volumes of professional documents (finance, law, medicine), PageIndex is an option worth prioritizing.


发布于 Friday, May 15, 2026 · 最后更新 Friday, May 15, 2026