LlamaIndex: 49K+ Stars — Production RAG Deployment Guide 2026

LlamaIndex is a data framework for building production RAG systems with LLMs. Supports OpenAI, Anthropic, Ollama, Qdrant, Weaviate, Chroma. Covers Docker deployment, query engines, agents, and benchmarks vs LangChain/Haystack/RAGFlow.

  • ⭐ 49517
  • MIT
  • Updated 2026-05-19

{{< resource-info >}}

LlamaIndex Logo

Introduction #

Most RAG tutorials stop at a Jupyter notebook. You load a PDF, call VectorStoreIndex.from_documents(), get a nice answer, and call it a day. Then you try to deploy it. The embedding step takes 40 minutes on startup, your container crashes because the index is not persisted, and you have no idea which documents were actually retrieved for that answer a user complained about.

LlamaIndex has quietly become the go-to data framework for teams building production RAG systems. With 49,517 GitHub stars, 1,866 contributors, and a release cadence that ships version 0.14.22 in May 2026, the project moves fast. This guide walks through building a production-grade RAG pipeline with LlamaIndex: from llamaindex Docker deployment to query routing, monitoring, and hardening for a complete production RAG setup. Whether you are evaluating llamaindex vs langchain or need a llamaindex tutorial that covers real deployment concerns, this article gives you the full stack.

What Is LlamaIndex? #

LlamaIndex is an open-source data framework that connects large language models (LLMs) to external data sources through retrieval-augmented generation (RAG) pipelines. It provides tools for data loading, indexing, querying, and agent orchestration, with over 160 data connectors and native integrations with major vector databases and LLM providers.

Originally focused on indexing (hence the name), LlamaIndex has expanded into a full platform for building agentic applications. The framework handles ingestion pipelines, multiple index types, query engines with routing, and event-driven workflows. All components are MIT-licensed and available on PyPI.

How LlamaIndex Works #

Core Architecture #

LlamaIndex separates concerns into four layers:

  1. Data LoadingSimpleDirectoryReader and 160+ LlamaHub connectors parse PDFs, databases, APIs, and cloud storage into Document objects.
  2. Indexing — Documents split into Nodes. Embeddings feed into indices (VectorStoreIndex, SummaryIndex, TreeIndex, KnowledgeGraphIndex).
  3. QueryingQueryEngine, ChatEngine, and RouterQueryEngine handle retrieval, post-processing, and response synthesis.
  4. Agents & Workflows — Event-driven Workflow classes and agent tools enable multi-step reasoning with human-in-the-loop support.

RAG Architecture

Key Design Decisions #

  • Nodes over raw documents: Chunking happens before indexing, letting you tune overlap and size per use case.
  • StorageContext abstraction: Indices persist to disk, S3, or any vector store without code changes.
  • Composable retrievers: Vector + keyword + graph retrievers combine through RouterQueryEngine.
  • Async-first: .aquery() and async ingestion are native, not bolted-on.

Installation & Setup — LlamaIndex Getting Started #

Basic Install #

# Create virtual environment
python -m venv venv && source venv/bin/activate

# Install core framework
pip install llama-index

# For specific integrations
pip install llama-index-vector-stores-qdrant
pip install llama-index-llms-openai
pip install llama-index-embeddings-openai

Environment Setup #

# .env file
export OPENAI_API_KEY="sk-..."
export OPENAI_EMBEDDING_MODEL="text-embedding-3-large"

# For local LLMs
export OLLAMA_BASE_URL="http://localhost:11434"

Your First RAG Pipeline #

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

# Load documents
documents = SimpleDirectoryReader("./data").load_data()

# Build vector index
index = VectorStoreIndex.from_documents(documents)

# Create query engine
query_engine = index.as_query_engine()

# Query
response = query_engine.query("What are the key takeaways?")
print(response)

Persisting the Index #

import os
from llama_index.core import StorageContext, load_index_from_storage

PERSIST_DIR = "./storage"

if not os.path.exists(PERSIST_DIR):
    documents = SimpleDirectoryReader("./data").load_data()
    index = VectorStoreIndex.from_documents(documents)
    index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context)

This pattern avoids re-computing embeddings on every restart. For a 10,000-document corpus, that saves 6+ minutes and API costs on each deployment.

OpenAI / Anthropic #

from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

Settings.llm = OpenAI(model="gpt-4o-mini")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-large")

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

Ollama (Local LLMs) #

from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core import Settings

Settings.llm = Ollama(model="llama3.2", request_timeout=60.0)
Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

Qdrant (Vector Database) #

from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import StorageContext
import qdrant_client

client = qdrant_client.QdrantClient(url="http://localhost:6333")
vector_store = QdrantVectorStore(client=client, collection_name="my_docs")
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)

Weaviate #

from llama_index.vector_stores.weaviate import WeaviateVectorStore
import weaviate

client = weaviate.Client(url="http://localhost:8080")
vector_store = WeaviateVectorStore(weaviate_client=client, index_name="Documents")
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)

Chroma #

from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

chroma_client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = chroma_client.get_or_create_collection("docs")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)

Benchmarks / Real-World Use Cases #

RAG Performance Benchmarks #

Independent benchmarks from 2025-2026 testing on 10,000-document corpora with GPT-4o-mini:

MetricLlamaIndexLangChainHaystackRAGFlow
RAG Accuracy (RAGAS)0.810.720.790.77
Avg Query Latency0.9s1.2s1.1s1.4s
Index Build Time (10k docs)6 min8 min7 min9 min
Memory FootprintLowerModerateModerateHigher
Context Window Utilization78%65%72%68%

Source: Aggregated from community benchmarks and independent testing reports (2025-2026). Actual results vary by configuration.

Production Use Cases #

  • Enterprise Knowledge Base: A fintech company indexes 500k regulatory PDFs with VectorStoreIndex + Qdrant, achieving sub-second query latency.
  • Multi-Document Q&A: Legal teams use RouterQueryEngine to route queries between vector search (for case law) and keyword search (for exact statute references).
  • Agentic Research Assistant: Workflow classes with tool-calling agents perform multi-step research, web search, and citation generation.
  • Chatbot with Memory: ChatEngine with CondensePlusContextMode handles multi-turn conversations over proprietary documentation.

When to Choose LlamaIndex #

ScenarioRecommended Approach
Document-heavy Q&AVectorStoreIndex + query engine
Multiple data sourcesRouterQueryEngine + multiple indices
Multi-turn chatChatEngine with memory
Complex reasoningWorkflow with agent tools
Structured extractionPydanticProgram response models

Advanced Usage / Production Hardening #

Router Query Engine #

Route queries to different indices based on intent:

from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import PydanticSingleSelector

# Create multiple indices
vector_index = VectorStoreIndex(nodes)
summary_index = SummaryIndex(nodes)

# Build query engines
vector_engine = vector_index.as_query_engine()
summary_engine = summary_index.as_query_engine()

# Define tools with descriptions
query_engine_tools = [
    QueryEngineTool(
        query_engine=vector_engine,
        metadata=ToolMetadata(
            name="semantic_search",
            description="Useful for finding specific facts and details"
        ),
    ),
    QueryEngineTool(
        query_engine=summary_engine,
        metadata=ToolMetadata(
            name="summarization",
            description="Useful for getting high-level summaries"
        ),
    ),
]

# Router selects the best engine per query
router_engine = RouterQueryEngine(
    selector=PydanticSingleSelector.from_defaults(),
    query_engine_tools=query_engine_tools,
)

response = router_engine.query("Summarize the main points")

Custom Node Post-Processor #

from llama_index.core.postprocessor import BaseNodePostprocessor
from llama_index.core.schema import NodeWithScore, QueryBundle

class ScoreThresholdPostprocessor(BaseNodePostprocessor):
    def __init__(self, threshold: float = 0.7):
        self.threshold = threshold
        super().__init__()

    def _postprocess_nodes(
        self, nodes: list[NodeWithScore], query_bundle: QueryBundle | None = None
    ) -> list[NodeWithScore]:
        return [n for n in nodes if n.score >= self.threshold]

# Use in query engine
query_engine = index.as_query_engine(
    node_postprocessors=[ScoreThresholdPostprocessor(threshold=0.75)]
)

Async Query Pipeline #

import asyncio

async def batch_queries(queries: list[str]) -> list[str]:
    tasks = [query_engine.aquery(q) for q in queries]
    responses = await asyncio.gather(*tasks)
    return [str(r) for r in responses]

queries = [
    "What is the refund policy?",
    "How do I reset my password?",
    "What are the SLA terms?",
]

results = asyncio.run(batch_queries(queries))
for q, r in zip(queries, results):
    print(f"Q: {q}\nA: {r}\n")

Docker Deployment #

# Dockerfile
FROM python:3.12-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .
EXPOSE 8000

CMD ["python", "app.py"]
# app.py - FastAPI service
from fastapi import FastAPI
from llama_index.core import StorageContext, load_index_from_storage
from pydantic import BaseModel
import os

app = FastAPI()

PERSIST_DIR = os.environ.get("PERSIST_DIR", "./storage")
storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
index = load_index_from_storage(storage_context)
query_engine = index.as_query_engine()

class QueryRequest(BaseModel):
    query: str

@app.post("/query")
async def query_docs(request: QueryRequest):
    response = query_engine.query(request.query)
    return {
        "answer": str(response),
        "sources": [n.metadata for n in response.source_nodes],
    }
# docker-compose.yml
version: "3.8"
services:
  app:
    build: .
    ports:
      - "8000:8000"
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - PERSIST_DIR=/app/storage
    volumes:
      - ./storage:/app/storage:ro

  qdrant:
    image: qdrant/qdrant:latest
    ports:
      - "6333:6333"
    volumes:
      - qdrant_data:/qdrant/storage

volumes:
  qdrant_data:

DigitalOcean Deployment #

For production deployments on cloud infrastructure, DigitalOcean provides a straightforward path. Their App Platform supports Docker containers with automatic HTTPS, and managed databases can host your vector store backend.

Deploy the Docker Compose stack to a DigitalOcean Droplet:

# On your Droplet
docker-compose up -d

# Or use doctl
doctl apps create --spec .do/app.yaml

This article contains affiliate links to DigitalOcean. We may earn a commission if you sign up through our referral link at no additional cost to you.

Monitoring with Callbacks #

from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
import tiktoken

token_counter = TokenCountingHandler(
    tokenizer=tiktoken.encoding_for_model("gpt-4o-mini").encode,
    verbose=True,
)
Settings.callback_manager = CallbackManager([token_counter])

# After queries
print(f"LLM Tokens: {token_counter.total_llm_token_count}")
print(f"Embedding Tokens: {token_counter.total_embedding_token_count}")

Production Checklist #

ConcernImplementation
Index persistencestorage_context.persist() on build
Hot reloadLoad from storage at startup
API rate limitingAdd FastAPI middleware
Input validationPydantic schemas on all endpoints
Source citationsReturn source_nodes metadata
Token budgetTokenCountingHandler monitoring
Async supportUse .aquery() for concurrent loads
Secrets managementEnvironment variables, never hardcode

Comparison with Alternatives #

FeatureLlamaIndexLangChainHaystackRAGFlow
Primary FocusData indexing & retrievalAgent orchestration & chainsProduction RAG pipelinesVisual RAG builder
GitHub Stars49.5k95k25.3k80.9k
LicenseMITMITApache-2.0Apache-2.0
Data Connectors160+100+30+50+
Index Types8+ (Vector, Tree, Graph, etc.)Basic (FAISS, Chroma)Custom (Document Stores)Vector + Full-text
Query RoutingNative RouterQueryEngineLangGraph / manualPipeline-basedWorkflow-based
Retrieval Speed40% faster than LangChainBaselineCompetitiveSlower (visual overhead)
Agent SupportWorkflows + ToolsLangGraph AgentsCustom AgentsBuilt-in agent templates
Learning CurveGentle for RAGSteep (highly modular)ModerateLow (visual UI)
Best ForDocument Q&A, RAGComplex multi-agent systemsEnterprise productionNo-code RAG setup

How to choose: Use LlamaIndex when your primary need is fast, accurate document retrieval. Use LangChain when building complex agent workflows with many tools. Use Haystack when enterprise monitoring and auditability matter most. Use RAGFlow when your team wants a visual, low-code approach.

Limitations / Honest Assessment #

What LlamaIndex is not good for:

  1. Complex multi-agent orchestration: LangGraph provides better abstractions for agents with conditional branching, cycles, and parallel execution.
  2. No-code users: RAGFlow’s visual builder is a better fit for teams that prefer drag-and-drop interfaces.
  3. Heavy document parsing: While LlamaParse exists as a paid service, RAGFlow’s DeepDoc parser handles complex PDFs (tables, layouts) more effectively out of the box.
  4. Non-Python stacks: TypeScript support exists (llamaindex npm package) but lags behind Python in feature parity.
  5. Small resource environments: The framework imports many modules. For constrained edge deployments, lighter alternatives like txtai or direct API calls may be preferable.

Frequently Asked Questions #

Q1: How does LlamaIndex differ from LangChain?

LlamaIndex focuses on data ingestion, indexing, and retrieval optimization. LangChain is a general-purpose orchestration framework for chaining LLM operations. Teams often combine both: LlamaIndex handles the retrieval layer, LangChain manages the agent logic. If you are deciding between llamaindex vs langchain for a RAG project, LlamaIndex provides faster setup and better retrieval performance.

Q2: Can I use LlamaIndex with local models only?

Yes. The Ollama integration supports any model available through Ollama, including Llama 3.2, Mistral, and CodeLlama. Set OLLAMA_BASE_URL and use Ollama as the LLM and OllamaEmbedding for embeddings. This removes all external API dependencies.

Q3: How do I scale LlamaIndex to handle millions of documents?

Use a production vector database (Qdrant, Weaviate, or Pinecone) instead of in-memory storage. Run ingestion as a batch job separate from the query service. Consider IngestionPipeline with parallel node parsing and batched embedding generation.

Q4: Does LlamaIndex support streaming responses?

Yes. Pass streaming=True to as_query_engine() and iterate over the response:

query_engine = index.as_query_engine(streaming=True)
response = query_engine.query("Explain the architecture")
for token in response.response_gen:
    print(token, end="")

Q5: How do I evaluate my RAG pipeline quality?

LlamaIndex provides built-in evaluation modules:

from llama_index.core.evaluation import FaithfulnessEvaluator, RelevancyEvaluator

faith_eval = FaithfulnessEvaluator()
relevancy_eval = RelevancyEvaluator()

response = query_engine.query("What is the refund policy?")
faith_result = faith_eval.evaluate(response=response)
relevancy_result = relevancy_eval.evaluate(response=response, query="What is the refund policy?")

print(f"Faithful: {faith_result.passing}")
print(f"Relevant: {relevancy_result.passing}")

Q6: Is LlamaIndex free for commercial use?

Yes. The core framework is MIT-licensed and free for commercial use. LlamaIndex (the company) offers paid services like LlamaParse (document parsing) and LlamaCloud (managed hosting), but the open-source framework has no usage restrictions.

Conclusion #

LlamaIndex occupies a specific and valuable niche: it makes building production RAG systems straightforward without hiding the internals you need to tune. The 49K+ stars and active community reflect its maturity. If your application centers on document retrieval, semantic search, or knowledge-base Q&A, LlamaIndex provides the right balance of structure and flexibility.

LlamaIndex Production Architecture

Next steps: Clone the repo, run the 5-line quickstart, then deploy the Docker setup to your infrastructure. For questions and community support, join the dibi8 Telegram Group to connect with other developers building production RAG systems, or follow discussions on the LlamaIndex Discord and the GitHub repository.

Affiliate Disclosure: This article contains links to DigitalOcean. We may receive compensation if you purchase services through these links. This does not affect our editorial independence or recommendations.

Before you deploy any of the tools above into production, you’ll need solid infrastructure. Two options dibi8 actually uses and recommends:

  • DigitalOcean — $200 free credit for 60 days across 14+ global regions. The default option for indie devs running open-source AI tools.
  • HTStack — Hong Kong VPS with low-latency access from mainland China. This is the same IDC that hosts dibi8.com — battle-tested in production.

Affiliate links — they don’t cost you extra and they help keep dibi8.com running.

Sources & Further Reading #

💬 Discussion