MinerU — the open-source document parsing engine that turned 70,600 GitHub stars in just over a year.

In the age of AI coding agents and RAG pipelines, the biggest bottleneck isn’t model capability — it’s data quality. You can have the most powerful LLM in the world, but if your input documents are messy PDFs with broken tables, unreadable formulas, and garbled headers, the output will be garbage.

Enter MinerU, a document parsing tool from OpenDataLab (the team behind InternLM) that converts PDF, DOCX, PPTX, XLSX, images, and web pages into clean, structured Markdown and JSON — specifically designed for downstream LLM, RAG, and Agent workflows.

With 70,600+ GitHub stars, 5,900 forks, and 960 stars gained today alone, MinerU has quickly become the go-to open-source solution for document-to-LLM-pipeline conversion.

What Is MinerU? #

MinerU was born during the pre-training process of InternLM, a Chinese large language model. The team noticed that converting scientific papers and technical documents into machine-readable format was a massive pain point — especially for formulas, tables, and complex layouts.

So they built MinerU.

Unlike traditional PDF parsers that just extract raw text, MinerU understands document structure:

Removes headers, footers, footnotes, and page numbers that break semantic coherence
Preserves reading order for single-column, multi-column, and complex layouts
Converts formulas to LaTeX, tables to HTML
Detects scanned and garbled PDFs and automatically enables OCR
Supports 109 languages for OCR recognition
Outputs multimodal Markdown, NLP Markdown, and JSON sorted by reading order

The result is document content that AI agents and LLMs can actually understand.

Installation and Setup #

MinerU offers multiple installation paths depending on your needs:

pip (Recommended for most users) #

pip install mineru

Docker #

docker pull mineru/mineru:latest
docker run --gpus all -v $(pwd):/data mineru/mineru:latest

Local Development #

git clone https://github.com/opendatalab/MinerU.git
cd MinerU
pip install -e .

MinerU supports both CPU-only and GPU-accelerated inference. For GPU acceleration, install with CUDA support:

pip install mineru[cuda]

On macOS with Apple Silicon, MinerU leverages MPS (Metal Performance Shaders) for acceleration.

Three Parsing Engines #

MinerU provides three different parsing backends, each optimized for different scenarios:

1. Pipeline Backend (Fast & Stable) #

The pipeline backend is the default choice for most users. It’s fast, stable, and produces no hallucinations. It runs efficiently on CPU and is ideal for batch processing.

mineru ./input.pdf -o ./output/

Best for: High-volume document processing, CI/CD pipelines, CPU-only environments.

2. VLM Engine (High Accuracy) #

The VLM (Vision Language Model) engine uses MinerU’s proprietary MinerU2.5-Pro-2604-1.2B model for state-of-the-art parsing accuracy. It excels on complex documents with mixed layouts, handwritten text, and dense formulas.

mineru ./complex.pdf -o ./output/ --engine vlm-engine

Best for: Complex scientific papers, scanned documents, handwritten content, multi-language OCR.

3. Hybrid Engine (Balanced) #

The hybrid engine combines native text extraction with VLM-based analysis. Starting from version 3.3, it includes an effort parameter with medium and high levels:

Medium effort: 35-220% faster than high, with only 0.13-point accuracy drop on OmniDocBench
High effort: Maximum accuracy with image analysis support

mineru ./document.pdf -o ./output/ --engine hybrid-engine --effort medium

Best for: Production workloads where you need to balance speed and accuracy.

Supported Formats #

MinerU supports a comprehensive range of input formats:

Format	Support Level	Notes
PDF	Native	Text PDFs, scanned PDFs, garbled PDFs
DOCX	Native	Full structural preservation
PPTX	Native	Slides, layouts, embedded content
XLSX	Native	Tables, formulas, charts
Images	OCR	JPEG, PNG, TIFF, BMP, and more
Web Pages	Scraping	Full-page conversion to Markdown

The native DOCX, PPTX, and XLSX support (added in version 3.1.0) means MinerU can parse these formats directly without converting to PDF first — a significant speed improvement over traditional workflows.

OCR: 109 Languages #

MinerU’s OCR engine supports 109 languages and was upgraded to PP-OCRv6 in version 3.4, improving accuracy by approximately 11% on the OmniDocBench v1.6 benchmark.

Starting from version 3.4, MinerU simplified the OCR language configuration. Instead of selecting individual languages (Japanese, Traditional Chinese, English, Latin), all scenarios now route through the optimized ch OCR model, reducing configuration complexity while improving accuracy.

# Automatic OCR detection (recommended)
mineru ./scanned.pdf -o ./output/

# Force OCR on a specific document
mineru ./document.pdf -o ./output/ --ocr

Real-World Use Cases #

RAG Pipeline Data Preparation #

MinerU is purpose-built for RAG (Retrieval-Augmented Generation) workflows. By converting documents into structured Markdown with preserved reading order, headings, and semantic structure, RAG systems can chunk and embed documents far more effectively.

import mineru as mu

result = mu.parse("./research_paper.pdf")
# result.markdown: Clean Markdown for embedding
# result.json: Structured JSON for retrieval
# result.layout: Visual layout for debugging

AI Agent Knowledge Base #

For AI agents that need to reason over documents, MinerU provides the cleanest possible input. The removal of headers, footers, and page numbers ensures agents don’t get confused by repetitive page artifacts.

Training Data Production #

MinerU was originally built to support InternLM’s pre-training pipeline. Its ability to convert complex scientific papers into clean text makes it invaluable for producing high-quality training data for domain-specific LLMs.

Batch Document Processing #

With support for multi-threaded concurrent inference and streaming writes to disk, MinerU can process thousands of documents without manual splitting. The sliding-window mechanism reduces peak memory usage in long-document scenarios.

Integration Ecosystem #

MinerU integrates with virtually every major AI framework:

Framework	Integration
LangChain	Native document loader
LlamaIndex	Document parser integration
RAGFlow	Built-in parser
RAG-Anything	Pipeline integration
Flowise	Visual workflow support
Dify	Document processing node
FastGPT	Native support

MCP Server #

MinerU also provides an MCP Server for integration with AI coding tools like Cursor, Claude Desktop, and Windsurf. This allows you to parse documents directly within your coding agent’s workflow.

# Start the MCP server
mineru-mcp-server

Performance Benchmarks #

MinerU’s pipeline backend achieves a score of 86.2 on OmniDocBench v1.5, surpassing the accuracy of the previous-generation VLM model MinerU2.0-2505-0.9B.

The Hybrid engine with effort=medium delivers:

~80% faster for text PDF scenarios on Linux
~90% faster for text PDF scenarios on Windows
~220% faster for text PDF scenarios on macOS
Only 0.13-point accuracy drop compared to effort=high

Why MinerU Stands Out #

Built for AI, not humans. Unlike general-purpose PDF converters, MinerU’s output is optimized for LLM consumption — preserving semantic structure while removing noise that confuses AI models.
Multi-format native support. DOCX, PPTX, XLSX, PDF, images, and web pages — all parsed natively without format conversion intermediaries.
Production-ready at scale. Multi-threaded concurrency, GPU/CPU flexibility, Docker deployment, and a robust API/CLI/router architecture make MinerU suitable for enterprise workloads.
Open license. Moved from AGPLv3 to a custom Apache 2.0-based license in version 3.1.0, significantly reducing adoption barriers for both community and commercial use.
Domestic AI chip support. Compatible with 10+ Chinese AI accelerators including Ascend, Cambricon, Enflame, Moore Threads, and others.

Getting Started #

The easiest way to try MinerU is through the online web application, which offers the same functionality as the desktop client with no installation required.

For developers, the Colab notebook provides a quick interactive demo.

# Install and parse your first document
pip install mineru
mineru ./my-document.pdf -o ./output/ --format markdown

Limitations #

Complex layouts: While MinerU handles most document types well, extremely complex layouts (multi-column scientific papers with interleaved figures and tables) may still require manual review.
GPU requirement for VLM: The VLM engine requires significant VRAM (8GB+ recommended) for optimal performance.
Learning curve: The three-engine architecture and various configuration options can be overwhelming for beginners.

Conclusion #

MinerU represents a significant step forward in document-to-LLM conversion. By combining native multi-format parsing, 109-language OCR, and AI-optimized output formatting, it fills a critical gap in the modern AI data pipeline.

Whether you’re building a RAG system, training a domain-specific LLM, or creating a knowledge base for AI agents, MinerU provides the clean, structured input that makes these systems actually work.

With 70,600+ stars, an active development team, and growing framework integrations, MinerU is positioned to become the de facto standard for open-source document parsing in the AI era.

Resources #

GitHub repository: https://github.com/opendatalab/MinerU
Documentation: https://opendatalab.github.io/MinerU/
Online demo: https://mineru.net/OpenSourceTools/Extractor
Technical report (v3.1): https://arxiv.org/abs/2604.04771
Technical report (v2.5): https://arxiv.org/abs/2509.22186
Technical report (v2.0): https://arxiv.org/abs/2409.18839
HuggingFace demo: https://huggingface.co/spaces/opendatalab/MinerU
ModelScope demo: https://www.modelscope.cn/studios/OpenDataLab/MinerU
Discord community: https://discord.gg/Tdedn9GTXq

Disclosure: This article contains affiliate links. If you sign up through our links, we may earn a small commission at no additional cost to you. This helps support independent tech journalism and keeps resources like dibi8.com free and ad-free.

MinerU: 70.6K Stars — Convert Any Document to LLM-Ready Markdown

What Is MinerU? #

Installation and Setup #

pip (Recommended for most users) #

Docker #

Local Development #

Three Parsing Engines #

1. Pipeline Backend (Fast & Stable) #

2. VLM Engine (High Accuracy) #

3. Hybrid Engine (Balanced) #

Supported Formats #

OCR: 109 Languages #

Real-World Use Cases #

RAG Pipeline Data Preparation #

AI Agent Knowledge Base #

Training Data Production #

Batch Document Processing #

Integration Ecosystem #

MCP Server #

Performance Benchmarks #

Why MinerU Stands Out #

Getting Started #

Limitations #

Conclusion #

Resources #

📦 Featured in collections

💬 Discussion

What Is MinerU? #

Installation and Setup #

pip (Recommended for most users) #

Docker #

Local Development #

Three Parsing Engines #

1. Pipeline Backend (Fast & Stable) #

2. VLM Engine (High Accuracy) #

3. Hybrid Engine (Balanced) #

Supported Formats #

OCR: 109 Languages #

Real-World Use Cases #

RAG Pipeline Data Preparation #

AI Agent Knowledge Base #

Training Data Production #

Batch Document Processing #

Integration Ecosystem #

MCP Server #

Performance Benchmarks #

Why MinerU Stands Out #

Getting Started #

Limitations #

Conclusion #

Resources #

🔗 Related Resources

📦 Featured in collections

💬 Discussion