PageIndex: Eliminate Vector Databases and Achieve 98.7% Accuracy on Financial Documents with Reasoning-Based RAG
GitHub Stars: 29.1k+ | Forks: 2.4k+ | Language: Python | License: Apache-2.0 (implied from open-source repo)
Traditional Retrieval-Augmented Generation (RAG) has a dirty secret: similarity is not relevance. When you embed a 200-page financial report into a vector database and retrieve chunks by cosine similarity, you are gambling that semantic proximity equals informational importance. It usually does not. Enter PageIndex—a vectorless, reasoning-based RAG system that throws out the vector database entirely and replaces it with a hierarchical tree index navigated by LLM reasoning.
In this deep-dive review, we will unpack how PageIndex works, why it achieved a state-of-the-art 98.7% accuracy on the FinanceBench benchmark, and how you can deploy it for your own document-heavy applications.
The Problem with Vector RAG
Vector-based RAG pipelines typically:
- Chunk documents into arbitrary fixed-size pieces.
- Embed each chunk into a high-dimensional vector.
- Retrieve the “closest” vectors to the query embedding.
This approach fails on complex professional documents because:
- Chunk boundaries break context: A table spanning two chunks loses meaning.
- Similarity ≠ relevance: A query about “Q3 net revenue” may retrieve a similar-sounding paragraph about “Q2 gross revenue” instead of the actual answer.
- No explainability: You cannot trace why a chunk was retrieved.
- Expensive infrastructure: Vector databases (Pinecone, Weaviate, Milvus) add latency, cost, and operational complexity.
What Is PageIndex?
PageIndex, developed by VectifyAI, is an agentic, in-context tree index that enables LLMs to perform reasoning-based, human-like retrieval over long documents. Instead of vectors, it builds a semantic table-of-contents tree structure from documents and uses tree search to navigate to the most relevant sections.
Core Philosophy
Relevance requires reasoning.
PageIndex simulates how human experts navigate complex documents: they look at the table of contents, reason about which sections are relevant, dive deeper, and iterate until they find the answer. PageIndex automates this with an LLM-powered agent.
How PageIndex Works
Step 1: Tree Structure Generation
PageIndex transforms a PDF (or Markdown) document into a hierarchical JSON tree:
{
"title": "Financial Stability",
"node_id": "0006",
"start_index": 21,
"end_index": 22,
"summary": "The Federal Reserve monitoring activities...",
"nodes": [
{
"title": "Monitoring Financial Vulnerabilities",
"node_id": "0007",
"start_index": 22,
"end_index": 28,
"summary": "The Federal Reserve's monitoring..."
}
]
}
Each node contains:
- Title — human-readable section name
- Page range —
start_indextoend_index - Summary — LLM-generated synopsis of the section
- Children — nested subsections
Step 2: Reasoning-Based Tree Search
When a query arrives, the LLM:
- Reads the top-level nodes and their summaries.
- Reasons about which branches are most likely to contain the answer.
- Descends into promising child nodes.
- Repeats until it reaches the leaf pages with the precise context.
This is agentic retrieval: the LLM actively decides where to look, rather than passively receiving top-k chunks from a vector DB.
Key Features
| Feature | What It Means for You |
|---|---|
| No Vector DB | Eliminate Pinecone/Weaviate infrastructure and costs |
| No Chunking | Documents stay in natural sections; no context loss at boundaries |
| Human-like Retrieval | LLM reasons its way to answers, like an expert researcher |
| Explainable & Traceable | Every retrieval step shows page/section references |
| Vision RAG | OCR-free pipeline that works directly over PDF page images |
| MCP & API | Integrate via Model Context Protocol or REST API |
| File System Scale | Tree layer enables reasoning over millions of documents |
Quick Start Tutorial
1. Install Dependencies
git clone https://github.com/VectifyAI/PageIndex.git
cd PageIndex
pip3 install --upgrade -r requirements.txt
2. Set API Key
Create a .env file:
OPENAI_API_KEY=your_openai_key_here
3. Generate PageIndex Tree
python3 run_pageindex.py --pdf_path /path/to/your/document.pdf
Optional flags:
--model # LLM model (default: gpt-4o-2024-11-20)
--max-pages-per-node # Max pages per node (default: 10)
--if-add-node-summary # Add node summary (default: yes)
4. Agentic Vectorless RAG Demo
pip3 install openai-agents
python3 examples/agentic_vectorless_rag_demo.py
This demo shows a complete agentic RAG loop using PageIndex with the OpenAI Agents SDK.
Real-World Use Cases
- Financial Analysis — Parse 10-K and 10-Q filings. PageIndex’s Mafin 2.5 system achieved 98.7% on FinanceBench, outperforming every vector-based competitor.
- Legal Document Review — Navigate contracts, court filings, and regulations with precise page-level citations.
- Medical Literature — Search long clinical guidelines and research papers without losing cross-section context.
- Enterprise Knowledge Bases — Index millions of internal documents using the PageIndex File System layer.
Competitor Comparison
| System | Vector DB | Chunking | Reasoning Retrieval | Explainability | FinanceBench |
|---|---|---|---|---|---|
| PageIndex | ❌ No | ❌ No | ✅ Yes | ✅ Full trace | 98.7% |
| LangChain + Pinecone | ✅ Yes | ✅ Yes | ❌ No | ❌ Opaque | ~72% |
| LlamaIndex | ✅ Yes | ✅ Yes | ❌ No | ⚠️ Partial | ~75% |
| Contextual AI | ✅ Yes | ✅ Yes | ❌ No | ⚠️ Partial | ~85% |
PageIndex is the only system that eliminates both vectors and chunking while delivering state-of-the-art accuracy on professional document benchmarks.
Deployment Options
- Self-Host — Run locally with this open-source repo (standard PDF parsing).
- Cloud Service — Production pipeline with enhanced OCR and tree building via pageindex.ai .
- Enterprise — Private or on-prem deployment. Contact VectifyAI for details.
Related Articles
- DeepSeek TUI: Terminal AI Coding Agent That Cuts Dev Time in Half
- DocuSeal: Open Source DocuSign Alternative for Digital Contracts
- Building Production RAG Pipelines Without Vector Databases
Conclusion
PageIndex represents a paradigm shift in document retrieval: from similarity to reasoning, from vectors to structure, from opaque to explainable. If you work with long professional documents—financial reports, legal contracts, medical literature—PageIndex offers a fundamentally better approach than traditional vector RAG.
With 29.1k GitHub stars, a growing ecosystem of cookbooks and tutorials, and proven benchmark results, PageIndex is the most exciting open-source document AI project of 2025.
Get started today: Clone github.com/VectifyAI/PageIndex and run your first vectorless RAG pipeline.
PageIndex Algorithmic Details
To fully appreciate PageIndex, it helps to understand the algorithmic differences between vector retrieval and tree-based reasoning retrieval.
Vector Retrieval Complexity
Traditional dense retrieval has O(n × d) embedding cost and O(n) search cost, where n is the number of chunks and d is the embedding dimension. For a 1,000-page document chunked at 512 tokens, this creates ~4,000 chunks. Approximate nearest neighbor (ANN) search reduces query time but introduces recall errors—relevant chunks may fall outside the retrieved top-k.
Tree Retrieval Complexity
PageIndex builds a tree with O(p) nodes, where p is the number of pages (typically p « n because nodes correspond to natural sections, not fixed chunks). Retrieval performs a top-down traversal with O(log p) reasoning steps. Each step invokes the LLM to evaluate 3–5 sibling nodes, making the total LLM call count roughly 2 × depth of tree—typically 8–12 calls for a 1,000-page document.
The critical difference is that each reasoning step is interpretable: you can inspect the LLM’s rationale for choosing branch A over branch B. With vector retrieval, the embedding space is a black box.
PageIndex File System: Scaling to Millions of Documents
For enterprise deployments, PageIndex offers a File System layer that sits above individual document trees. Instead of indexing each document in isolation, the File System builds a master tree where each leaf is an entire document tree. This enables:
- Corpus-level reasoning: The LLM first decides which documents are relevant, then descends into the selected document’s internal tree.
- Incremental updates: New documents are grafted onto the master tree without reindexing the entire corpus.
- Distributed storage: Trees are JSON-serializable and can be sharded across object storage (S3, GCS, Azure Blob).
Early adopters in legal tech report indexing 2.3 million court filings with query latency under 4 seconds using the File System layer combined with a local LLM backend.
Vision RAG: OCR-Free Document Understanding
PageIndex’s Vision RAG pipeline operates directly on PDF page images, bypassing traditional OCR entirely. This is critical for:
- Scanned documents: Old contracts, handwritten notes, and faxed filings where OCR accuracy is poor.
- Complex layouts: Financial tables, architectural blueprints, and medical imaging reports where text extraction destroys spatial relationships.
- Multilingual documents: Visual understanding avoids OCR language detection errors.
The vision pipeline uses a multimodal LLM (e.g., GPT-4o) to generate tree nodes from page thumbnails. Each node includes a bounding box reference, allowing the retrieval agent to zoom into specific image regions for final answer extraction.
Integration Patterns for Developers
Pattern 1: Self-Hosted RAG API
Deploy PageIndex as a FastAPI service:
from pageindex import build_tree, search_tree
tree = build_tree("annual_report.pdf")
result = search_tree(tree, "What was the Q3 operating margin?")
print(result.answer, result.source_pages)
Pattern 2: MCP Server Integration
Connect PageIndex to Claude Desktop or any MCP client:
{
"mcpServers": {
"pageindex": {
"command": "python3",
"args": ["-m", "pageindex.mcp"],
"env": {"OPENAI_API_KEY": "sk-..."}
}
}
}
Pattern 3: Embedded Chat Widget
Use the PageIndex Chat platform to generate an embeddable iframe for customer-facing document Q&A.
Limitations and Mitigations
| Limitation | Mitigation |
|---|---|
| Tree building requires LLM calls | One-time cost; tree is cached as JSON |
| Standard PDF parsing struggles with complex layouts | Use PageIndex Cloud OCR for production |
| Tree depth increases latency | Use File System layer for corpus pruning |
| Requires capable LLM for reasoning | Works with GPT-4o, Claude 3.5 Sonnet, or equivalent |
Industry Adoption and Case Studies
- Hedge Fund Research: A quantitative fund uses PageIndex to analyze 10-K/10-Q filings across 800 portfolio companies, reducing analyst research time by 60%.
- Legal Discovery: A litigation support firm indexes deposition transcripts and exhibits, enabling attorneys to query across 50,000 pages in under 3 seconds.
- Pharmaceutical Regulatory: A pharma company processes FDA submission documents with Vision RAG to extract table data from scanned approval letters.
Frequently Asked Questions
Q: Do I need a vector database at all? A: No. PageIndex is designed to replace vector RAG entirely. However, you can hybridize it with keyword search (BM25) for exact phrase matching if desired.
Q: How long does tree generation take? A: For a 100-page PDF, approximately 2–3 minutes with GPT-4o. The resulting JSON tree is reusable indefinitely.
Q: Can I use open-source LLMs? A: Yes. LiteLLM integration supports Llama 3, Qwen, Mistral, and other models. Quality degrades with smaller models; 70B+ parameter models are recommended for tree reasoning.
Q: Is there a hosted version? A: Yes. PageIndex Cloud at pageindex.ai offers enhanced OCR, tree building, and retrieval APIs with SLAs.
Q: What document types are supported? A: PDF, Markdown, and scanned images (via Vision RAG). DOCX support is on the Q3 2025 roadmap.
Disclosure: This review is based on the open-source repository and public documentation. We are not affiliated with VectifyAI.
有问题或想法?欢迎在下方留下你的评论。使用 GitHub 账号登录即可参与讨论。