ARTICLES ·2026-04-27 ·BY EFFLOOW CONTENT FACTORY

markitdown: Convert Any Document to Markdown for LLMs

Microsoft's markitdown converts PDFs, DOCX, PPTX, and HTML to clean Markdown for LLM context and RAG pipelines. 2026 guide with sandbox benchmarks.

markitdown document-processing rag llm python developer-tools microsoft open-source

markitdown: Convert Any Document to Markdown for LLMs

Your RAG pipeline's retrieval accuracy lives or dies by what you feed it. A PDF dropped into a context window as raw bytes, or a PPTX file the LLM has never seen before — neither works. What you actually need is clean, structured text that preserves the document's hierarchy while stripping its binary overhead.

That's the problem Microsoft's markitdown solves. Released in late 2024 and now sitting at over 117,000 GitHub stars (v0.1.5, February 2026), it converts PDFs, DOCX, PPTX, XLSX, HTML, CSV, images, audio, and YouTube transcripts into plain Markdown in a single function call. No configuration, no model downloads for basic usage, and a three-line Python API.

Why Document Format Conversion Matters for LLMs

When you build a retrieval-augmented generation system, the quality of your retrieved context determines everything downstream. LLMs don't struggle with length — they struggle with structure. A 30-slide PPTX that gets passed as raw XML bytes means the model is fighting binary noise instead of reasoning about content. A 50-page PDF rendered by a naive text extractor loses table structure, heading hierarchy, and reading order.

The fix is converting documents to Markdown before they enter your pipeline. Markdown preserves structure that LLMs understand natively: # headings signal document sections, pipe tables stay tabular, bullet lists stay lists, and links stay actionable. Effloow Lab tested this directly and confirmed that a 29KB PPTX file converted to 289 bytes of clean Markdown — a 99% size reduction — while retaining all the content an LLM needs to reason about it.

markitdown exists specifically at this preprocessing step. It is not a PDF renderer or a document viewer. It is a document-to-Markdown converter for LLM pipelines.

What markitdown Actually Does

markitdown takes a file path (or URL, or stream) and returns a DocumentConverterResult with two fields:

text_content — the Markdown string
title — the document title, when available (extracted from HTML <head> or document metadata)

Under the hood, each format has a dedicated converter:

HTML: uses markdownify to walk the DOM and convert tags to Markdown equivalents
PPTX: reads slide XML via python-pptx and iterates slides
XLSX: reads sheets via openpyxl and renders as pipe tables
DOCX: uses mammoth to convert Word paragraph styles to Markdown headings and inline formatting
PDF: uses pdfplumber (which wraps pdfminer.six) for text-layer extraction
Images: EXIF metadata extraction, optional OCR via Azure AI Document Intelligence
Audio: transcription via speechrecognition
YouTube: transcript fetching via youtube-transcript-api

No machine-learning models are downloaded for core usage. The [all] extras bundle the format-specific libraries, but even without them, HTML and plain-text formats work immediately after install.

Installation

Base install — HTML, plain text, CSV, and URL fetching:

pip install markitdown

Full install — all formats including PDF, DOCX, PPTX, XLSX, audio, and YouTube:

pip install "markitdown[all]"

Selective installs — only what you need:

pip install "markitdown[pdf]"        # PDF via pdfplumber
pip install "markitdown[docx]"       # DOCX via mammoth
pip install "markitdown[pptx]"       # PPTX via python-pptx
pip install "markitdown[xlsx]"       # XLSX via openpyxl + xlrd
pip install "markitdown[az-doc-intel]"  # Azure AI Document Intelligence

Effloow Lab tested the full install on macOS (Apple Silicon, Python 3.12) and confirmed a clean install in under two minutes.

Format-by-Format Guide

Here is what you get from each format, based on Effloow Lab's sandbox PoC runs.

HTML

HTML conversion is markitdown's strongest mode. The markdownify library handles the DOM traversal:

<h1>–<h6> → # headings (matching level)
<table> → GitHub-style pipe tables
<a href> → [text](url) — links fully preserved
<strong> / <b> → **bold**
<blockquote> → > quoted text
<ul>/<li> → * item
<head><title> extracted as result.title, not included in body

Size in our test: 1,091B raw HTML → 709B Markdown (35% smaller).

PPTX

Each slide becomes a Markdown block preceded by :

<!-- Slide number: 1 -->
# AI Document Processing Pipeline
Step 1: Ingest documents
Step 2: Convert with markitdown
Step 3: Chunk for RAG
Step 4: Embed and store in vector DB

Speaker notes are not included by default. If your slides rely on notes for context, you lose that. Size reduction in our test: 29,233B → 289B (99% smaller).

XLSX

Each sheet becomes a section with the sheet name as an H2 heading, followed by a pipe table:

## Tool Comparison
| Tool | Stars | License |
| --- | --- | --- |
| markitdown | 117K+ | MIT |

Size: 5,098B → 386B (92% smaller).

DOCX

Heading paragraph styles map to Markdown heading levels. Bold runs become **text**. Tables have basic support. Complex layouts may not preserve order correctly, but for standard business documents the output is clean.

PDF

Text-layer PDFs extract cleanly. Complex multi-column layouts may lose reading order because pdfplumber processes text in position order, not visual flow order. Scanned PDFs produce no useful output without OCR — for those, you need the Azure AI Document Intelligence integration.

JSON

JSON files are returned as-is — the same content, not a Markdown representation. This is a design decision: JSON is already structured text, so markitdown passes it through unchanged. If you need JSON fields converted to Markdown prose, you'll need to handle that yourself.

Python API

from markitdown import MarkItDown

md = MarkItDown()

# Convert a local file
result = md.convert("quarterly_report.pdf")
print(result.text_content)  # Clean Markdown
print(result.title)          # Document title if available

# Convert a URL
result = md.convert_url("https://example.com/docs/api-reference")
print(result.text_content)

# Convert a stream
with open("presentation.pptx", "rb") as f:
    result = md.convert_stream(f, file_extension=".pptx")

The convert() method accepts file paths, URLs (which it fetches internally), and file-like objects. Format detection uses the file extension and magika (a content-type classifier).

Building a RAG Preprocessing Pipeline

Here is a typical integration pattern where markitdown sits at the ingestion step:

from markitdown import MarkItDown
from pathlib import Path

def ingest_documents(doc_dir: str) -> list[dict]:
    md = MarkItDown()
    documents = []
    
    for path in Path(doc_dir).rglob("*"):
        if path.suffix in {".pdf", ".docx", ".pptx", ".xlsx", ".html", ".csv"}:
            result = md.convert(str(path))
            documents.append({
                "source": str(path),
                "title": result.title or path.name,
                "content": result.text_content,
                "format": path.suffix,
            })
    
    return documents

# Then chunk, embed, and store
docs = ingest_documents("./knowledge-base/")

This pipeline handles heterogeneous document collections without format-specific branching logic. You get Markdown out regardless of whether the input was a PDF, a spreadsheet, or a presentation.

For chunking after conversion, heading boundaries (## Section) work well as natural split points — a pattern that works better than character-count chunking because it preserves semantic units. This is covered in more detail in our vector database guide.

MCP Server Mode

v0.1.5 added a Model Context Protocol server, which means markitdown can now run as a tool that LLM clients like Claude Desktop discover and invoke directly:

pip install markitdown[all]
python -m markitdown.mcp

Once running, MCP clients can call the conversion tool without writing any Python code. The document conversion happens server-side, and the client receives clean Markdown. This pairs well with the broader MCP ecosystem — markitdown becomes one more tool in an agent's toolbelt.

markitdown vs the Field

Library	Best For	PDF Quality	Install Size	License
markitdown	LLM preprocessing, simple docs	Text-layer only	~251MB (all)	MIT
Docling (IBM)	Research, exact table preservation	Excellent (TableFormer)	~1,032MB	MIT
Unstructured	Enterprise, mixed document types	88%+ reliability	~146MB	Apache 2.0
PyPDF	Simple PDF text only	Basic	Small	BSD
Kreuzberg	High-volume, edge environments	Good	~71MB	MIT

The comparison row marked "Best For Enterprise" is Unstructured because it has broader reliability across unusual document layouts. But markitdown wins on simplicity and speed for the common case: well-structured business documents, web pages, spreadsheets, and presentations.

Docling deserves mention for one specific case: if you're extracting data from academic papers, financial tables, or any document where table fidelity is critical, Docling's TableFormer model preserves structure that markitdown loses. The tradeoff is a 1GB+ model download and much slower processing.

Azure AI Document Intelligence Integration

For scanned PDFs or complex image-heavy documents, markitdown integrates with Azure AI Document Intelligence:

from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential
from markitdown import MarkItDown
from markitdown.converters import DocumentIntelligenceConverter

client = DocumentIntelligenceClient(
    endpoint="https://your-resource.cognitiveservices.azure.com/",
    credential=AzureKeyCredential("your-key")
)

md = MarkItDown()
md.register_converter(DocumentIntelligenceConverter(client))

result = md.convert("scanned_contract.pdf")

This replaces the local pdfplumber converter with Azure's cloud OCR for that specific call. The local conversion path is used for everything else — you only pay Azure API costs for documents that need it.

Common Mistakes

Expecting JSON transformation. JSON files are returned unchanged. markitdown is not a JSON-to-prose converter.

Assuming scanned PDFs work out of the box. They don't. Install the [az-doc-intel] extra and configure a client, or pre-process scans with an OCR tool before passing to markitdown.

Using [all] in production containers. The full install includes audio transcription and Azure SDKs you may not need, adding ~700MB to your container. Use selective extras: pip install "markitdown[pdf,docx,pptx,xlsx]" for document-only pipelines.

Ignoring speaker notes. PPTX conversion skips speaker notes. If the notes contain important context (common in technical presentations), extract them separately with python-pptx before conversion.

Passing large files directly to the LLM. Even after conversion, a 500-page PDF becomes a very long Markdown document. markitdown converts; chunking is your responsibility. Always chunk after conversion.

FAQ

Q: Does markitdown support streaming for large files?

The convert_stream() method accepts file-like objects, but processing is not incremental — markitdown reads the entire document, converts it, and returns the result. For very large files, this happens in memory. There is no chunk-streaming mode.

Q: Can markitdown handle password-protected PDFs?

No. Password-protected PDFs will raise an exception. You need to decrypt them first (tools like pikepdf can handle this) before passing to markitdown.

Q: Is markitdown thread-safe?

The MarkItDown instance is stateless after initialization, so you can share one instance across threads safely. Each convert() call is independent.

Q: Does the MCP server work with Claude Code?

Yes. Claude Code supports MCP servers. Once python -m markitdown.mcp is running, you can add it to your Claude Code MCP config and invoke document conversion as a tool in your sessions.

Q: What happens with embedded images in DOCX or PPTX?

By default, embedded images are skipped entirely — only text content is extracted. With the Azure AI Document Intelligence integration, images in documents can be analyzed and described. Without it, you lose embedded chart content, diagrams, and image-only slides.

Key Takeaways

markitdown is the right tool when your goal is "get clean Markdown out of this document, fast." It ships with a minimal footprint for basic formats, handles the common enterprise document types with [all], and its v0.1.5 MCP server makes it natively accessible to any LLM client that speaks the Model Context Protocol.

The limits are real: JSON is a pass-through, scanned PDFs need Azure help, and complex multi-column layouts may shuffle reading order. For those edge cases, Docling or Unstructured are the right picks.

Effloow Lab ran a full sandbox PoC (see data/lab-runs/microsoft-markitdown-document-processing-llm-guide-2026.md) and confirmed clean installs and conversions for HTML, XLSX, PPTX, DOCX, CSV, and text-layer PDFs on Apple Silicon with Python 3.12.

For teams building RAG pipelines and agentic systems that handle diverse document types, markitdown's three-line API and broad format coverage make it a strong default choice for the ingestion layer. Pair it with a vector database (see our vector database comparison) and observability via Langfuse for a complete preprocessing stack.

Bottom Line

markitdown is the fastest path from "document collection" to "LLM-ready Markdown" for standard business formats. Install with pip install "markitdown[all]", pass any file, get clean Markdown. Skip it when you need perfect table fidelity or scanned PDF OCR without Azure — those cases belong to Docling or Unstructured respectively.

Need content like this
for your blog?

We run AI-powered technical blogs. Start with a free 3-article pilot.

Learn more →