description: “Learn how to use Microsoft’s MarkItDown to convert PDFs, Word docs, images, HTML, PPTX, and more into clean Markdown. Step-by-step installation, usage examples, Python API, AI pipeline integration, benchmarks, and comparisons with Pandoc, Calibre, and LibreOffice.” tags: [‘converter’, ‘file’, ‘guide’, ‘markdown’, ‘open-source’, ‘reference’, ‘self-hosted’, ’tutorial’] date: 2026-06-10 slug: “microsoft-markitdown-file-to-markdown-converter-cli” category: dev-utils github_repo: “https://github.com/microsoft/markitdown" license: MIT lang: zh faqs: #

Introduction #

markitdown: Convert Files & Office Docs to Markdown (141K Stars) • Hummingbot 2026: The Open-Source Crypto Trading Bot Running 50+ Exchange Connectors — Setup & Strategy Guide In today’s data-driven world, the ability to convert documents into structured, readable, and portable formats is more critical than ever. Whether you are building a Retrieval-Augmented Generation (RAG) pipeline, ingesting documents into an AI knowledge base, or simply trying to extract clean text from complex PDFs, having a reliable tool that converts any file format to Markdown is invaluable. Microsoft MarkItDown is an open-source Python tool built exactly for this purpose — it turns PDFs, Word documents, PowerPoint presentations, images, HTML pages, spreadsheets, and ZIP archives into clean, consistent Markdown output.

MarkItDown is developed and maintained by Microsoft and distributed under the permissive MIT License. It provides both a command-line interface and a Python library, making it equally suitable for interactive use and automated pipelines. With support for over 20 file formats and zero configuration required, MarkItDown has quickly become a go-to tool for developers, researchers, and data scientists who need to process documents at scale. With over 149,000 GitHub stars, it is one of the most widely adopted document conversion tools in the open-source ecosystem.

What Is MarkItDown? #

MarkItDown is a Python-based command-line tool and library developed by Microsoft that converts files of virtually any common format into Markdown text. It is designed to be the simplest possible way to get clean, structured Markdown from any document — no complex configuration, no setup of multiple parsers, no dependencies on proprietary software.

Key capabilities include:

Multi-format support — Converts PDF, DOCX, PPTX, XLSX, HTML, XML, EPUB, JPEG, PNG, BMP, TIFF, WAV, MP3, ZIP archives, and more
Zero-configuration — No setup required; works out of the box
CLI and Python API — Use as a command-line tool or integrate into Python applications
Batch processing — Process entire directories or ZIP archives in one command
Image OCR — Extract text from images using Tesseract OCR (optional dependency)
MIT licensed — Free for personal, commercial, and enterprise use
Plugin architecture — Extend with custom parsers or community plugins

The tool is particularly popular in the AI/ML community because Markdown is one of the most LLM-friendly formats. By converting documents to Markdown, you make them immediately usable by large language models, embedding pipelines, and vector databases. As document processing becomes a cornerstone of modern AI workflows, MarkItDown provides the universal bridge between proprietary file formats and open, text-based representations.

How MarkItDown Works #

MarkItDown operates on a straightforward principle: detect the file type, apply the appropriate parser, and produce clean Markdown. The tool uses a smart file-type detection system to determine the best conversion approach for each input. For text-based formats like DOCX and HTML, it parses the structured content directly. For binary formats like PDFs, it uses text extraction libraries that handle complex layouts, tables, and multi-column documents.

For PDF files, MarkItDown extracts text while preserving the document’s visual structure — headings become # headings, lists become - bullets, tables are converted to Markdown table syntax, and hyperlinks are preserved. For Word documents, it preserves formatting including bold, italic, headings, and embedded images. For PowerPoint presentations, each slide is converted into a structured Markdown section.

The tool also handles image files through OCR. When you provide a scanned document image, MarkItDown can use Tesseract OCR to extract text. For ZIP archives, it automatically processes each contained file individually and combines the results. The conversion pipeline works as follows:

File detection — The tool identifies the file type by extension and MIME type
Parser selection — The appropriate converter plugin is selected based on file type
Content extraction — Raw text and metadata are extracted from the file
Markdown formatting — Extracted content is formatted into clean, consistent Markdown
Output generation — Markdown text is returned as a string or written to a file

Installation & Setup #

MarkItDown is distributed as a Python package on PyPI, making installation straightforward with pip. All commands below are verified and taken from the official documentation.

Install via pip (Core Package) #

a
s
h
pip install 'markitdown[all]'

This installs the core MarkItDown package with all optional dependencies including python-docx, python-pptx, openpyxl, beautifulsoup4, and pytesseract for full coverage of all supported file types.

Verify Installation #

a
s
h
markitdown --version

A successful installation will print the current version number, such as markitdown, version 0.0.1a2.

Install Selective Dependencies #

For environments where you only need specific format support:

a
s
h
pip install 'markitdown[pdf, docx, pptx]'

This installs only the dependencies needed for PDF, DOCX, and PPTX conversion, keeping the installation lightweight.

Install Plugins #

MarkItDown supports plugin extensions for additional functionality:

a
s
h
markitdown --list-plugins
markitdown --use-plugins path-to-file.pdf

For OCR support on scanned images:

a
s
h
pip install markitdown-ocr

For Azure Content Understanding integration:

a
s
h
pip install 'markitdown[az-content-under understanding]'

Install from Source #

a
s
h
git clone git@github.com: microsoft/markitdown.git && cd markitdown && pip install -e 'packages/markitdown[all]'

Installing from source gives you access to the latest features and allows you to contribute changes back to the project.

Basic Usage Examples #

Convert a PDF to Markdown #

a
s
h
markitdown path-to-file.pdf > document.md

This command reads the PDF and outputs Markdown to stdout. The conversion preserves headings, lists, tables, and hyperlinks. You can redirect the output to a file for later use.

Convert with Explicit Output File #

a
s
h
markitdown path-to-file.pdf -o document.md

Using the -o flag, you specify the output file directly without shell redirection. This is useful in scripts where the output path may be dynamic.

Pipe Through Standard Input #

a
s
h
cat path-to-file.pdf | markitdown

MarkItDown can read from stdin, enabling creative pipeline compositions. For example, you can download a file and convert it in a single command:

a
s
h
curl -sL https://example.com/document.pdf | markitdown

Convert a Word Document #

a
s
h
markitdown report.docx > report.md

Word documents are converted with full formatting — headings, bold, italic, lists, tables, and embedded images are all preserved in the Markdown output.

Convert a PowerPoint Presentation #

a
s
h
markitdown presentation.pptx > slides.md

Each slide in the presentation is converted into a separate Markdown section with slide title, content, and speaker notes.

Convert an Excel Spreadsheet #

a
s
h
markitdown data.xlsx > data.md

Tables in spreadsheets are converted to Markdown table format, with each sheet getting its own section.

Convert an Image (OCR) #

a
s
h
markitdown scan.png > scan.md

For OCR to work, you need Tesseract installed on your system and the markitdown-ocr plugin:

a
s
h
sudo apt-get install tesseract-ocr
pip install markitdown-ocr

Process an Entire Directory #

a
s
h
markitdown ./documents/ -o ./output/

This recursively processes all supported files in the documents directory and saves the Markdown output to the output directory.

Using MarkItDown as a Python Library #

Beyond the CLI, MarkItDown provides a clean Python API for integrating into your applications.

Basic Python Usage #

h
o
n
import markitdown

md = markitdown.MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)

Converting from a File Object #

h
o
n
import markitdown

md = markitdown.MarkItDown()
with open("report.docx", "rb") as f:
    result = md.convert(f)
    print(result.text_content)

Accessing Metadata #

h
o
n
import markitdown

md = markitdown.MarkItDown()
result = md.convert("document.pdf")
print(result.metadata)
print(result.text_content)

Batch Processing with Python #

h
o
n
import markitdown
import glob
import os

md = markitdown.MarkItDown()
files = glob.glob("docs/**/*.pdf", recursive=True)
for filepath in files:
    result = md.convert(filepath)
    output_path = os.path.splitext(filepath)[0] + ".md"
    with open(output_path, "w") as f:
        f.write(result.text_content)
    print(f"Converted: {filepath} -> {output_path}")

Customizing the Converter #

h
o
n
import markitdown

md = markitdown.MarkItDown(
    allow_internal_hyperlinks=True,
    include_tables_in_output=True
)
result = md.convert("document.pdf")
print(result.text_content)

Integration with AI Pipelines #

RAG Pipeline Integration #

One of the most powerful use cases for MarkItDown is preparing documents for Retrieval-Augmented Generation pipelines. Here is a complete example:

h
o
n
import markitdown
import os
from langchain_text_splitters import RecursiveCharacterTextSplitter

def ingest_documents(directory):
    md = markitdown.MarkItDown()
    splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

    documents = []
    for filename in os.listdir(directory):
        filepath = os.path.join(directory, filename)
        if os.path.isfile(filepath):
            result = md.convert(filepath)
            if result:
                chunks = splitter.split_text(result.text_content)
                for i, chunk in enumerate(chunks):
                    documents.append({
                        "source": filename,
                        "chunk_index": i,
                        "content": chunk
                    })
    return documents

docs = ingest_documents("./knowledge_base")
print(f"Processed {len(docs)} document chunks")

Automated Document Processing Script #

a
s
h
#!/bin/bash
# process_uploads.sh — Process all uploaded documents daily

MARKDOWN_DIR="/var/markdown"
UPLOAD_DIR="/var/uploads"

mkdir -p "$MARKDOWN_DIR"

for file in "$UPLOAD_DIR"/*.pdf "$UPLOAD_DIR"/*.docx "$UPLOAD_DIR"/*.pptx; do
    [ -f "$file" ] || continue
    filename=$(basename "$file")
    markitdown "$file" > "$MARKDOWN_DIR/${filename%.*}.md"
    echo "Converted: $file"
done

AI Agent Document Ingestion #

h
o
n
import markitdown

def prepare_document_for_llm(filepath, max_tokens=4000):
    md = markitdown.MarkItDown()
    result = md.convert(filepath)

    if result:
        content = result.text_content[:max_tokens * 4]
        return {
            "status": "success",
            "content": content,
            "tokens_estimated": len(content) // 4,
            "format": "markdown"
        }
    return {"status": "error", "message": "Conversion failed"}

Benchmarks & Real-World Use Cases #

Conversion Speed by Format #

Microsoft MarkItDown：转换任何文件的完整指南

Introduction #

What Is MarkItDown? #

How MarkItDown Works #

Installation & Setup #

Install via pip (Core Package) #

Verify Installation #

Install Selective Dependencies #

Install Plugins #

Install from Source #

Basic Usage Examples #

Convert a PDF to Markdown #

Convert with Explicit Output File #

Pipe Through Standard Input #

Convert a Word Document #

Convert a PowerPoint Presentation #

Convert an Excel Spreadsheet #

Convert an Image (OCR) #

Process an Entire Directory #

Using MarkItDown as a Python Library #

Basic Python Usage #

Converting from a File Object #

Accessing Metadata #

Batch Processing with Python #

Customizing the Converter #

Integration with AI Pipelines #

RAG Pipeline Integration #

Automated Document Processing Script #

AI Agent Document Ingestion #

Benchmarks & Real-World Use Cases #

Conversion Speed by Format #

💬 留言讨论

Introduction #

What Is MarkItDown? #

How MarkItDown Works #

Installation & Setup #

Install via pip (Core Package) #

Verify Installation #

Install Selective Dependencies #

Install Plugins #

Install from Source #

Basic Usage Examples #

Convert a PDF to Markdown #

Convert with Explicit Output File #

Pipe Through Standard Input #

Convert a Word Document #

Convert a PowerPoint Presentation #

Convert an Excel Spreadsheet #

Convert an Image (OCR) #

Process an Entire Directory #

Using MarkItDown as a Python Library #

Basic Python Usage #

Converting from a File Object #

Accessing Metadata #

Batch Processing with Python #

Customizing the Converter #

Integration with AI Pipelines #

RAG Pipeline Integration #

Automated Document Processing Script #

AI Agent Document Ingestion #

Benchmarks & Real-World Use Cases #

Conversion Speed by Format #

🔗 相关资源推荐

💬 留言讨论