markitdown: Convert Any Document to Markdown for LLMs
Your RAG pipeline's retrieval accuracy lives or dies by what you feed it. A PDF dropped into a context window as raw bytes, or a PPTX file the LLM has never seen before — neither works. What you actually need is clean, structured text that preserves the document's hierarchy while stripping its binary overhead.
That's the problem Microsoft's markitdown solves. Released in late 2024 and now sitting at over 117,000 GitHub stars (v0.1.5, February 2026), it converts PDFs, DOCX, PPTX, XLSX, HTML, CSV, images, audio, and YouTube transcripts into plain Markdown in a single function call. No configuration, no model downloads for basic usage, and a three-line Python API.
Why Document Format Conversion Matters for LLMs
When you build a retrieval-augmented generation system, the quality of your retrieved context determines everything downstream. LLMs don't struggle with length — they struggle with structure. A 30-slide PPTX that gets passed as raw XML bytes means the model is fighting binary noise instead of reasoning about content. A 50-page PDF rendered by a naive text extractor loses table structure, heading hierarchy, and reading order.
The fix is converting documents to Markdown before they enter your pipeline. Markdown preserves structure that LLMs understand natively: # headings signal document sections, pipe tables stay tabular, bullet lists stay lists, and links stay actionable. Effloow Lab tested this directly and confirmed that a 29KB PPTX file converted to 289 bytes of clean Markdown — a 99% size reduction — while retaining all the content an LLM needs to reason about it.
markitdown exists specifically at this preprocessing step. It is not a PDF renderer or a document viewer. It is a document-to-Markdown converter for LLM pipelines.
What markitdown Actually Does
markitdown takes a file path (or URL, or stream) and returns a DocumentConverterResult with two fields:
text_content— the Markdown stringtitle— the document title, when available (extracted from HTML<head>or document metadata)
Under the hood, each format has a dedicated converter:
- HTML: uses
markdownifyto walk the DOM and convert tags to Markdown equivalents - PPTX: reads slide XML via
python-pptxand iterates slides - XLSX: reads sheets via
openpyxland renders as pipe tables - DOCX: uses
mammothto convert Word paragraph styles to Markdown headings and inline formatting - PDF: uses
pdfplumber(which wrapspdfminer.six) for text-layer extraction - Images: EXIF metadata extraction, optional OCR via Azure AI Document Intelligence
- Audio: transcription via
speechrecognition - YouTube: transcript fetching via
youtube-transcript-api
No machine-learning models are downloaded for core usage. The [all] extras bundle the format-specific libraries, but even without them, HTML and plain-text formats work immediately after install.
Installation
Base install — HTML, plain text, CSV, and URL fetching:
pip install markitdown
Full install — all formats including PDF, DOCX, PPTX, XLSX, audio, and YouTube:
pip install "markitdown[all]"
Selective installs — only what you need:
pip install "markitdown[pdf]" # PDF via pdfplumber
pip install "markitdown[docx]" # DOCX via mammoth
pip install "markitdown[pptx]" # PPTX via python-pptx
pip install "markitdown[xlsx]" # XLSX via openpyxl + xlrd
pip install "markitdown[az-doc-intel]" # Azure AI Document Intelligence
Effloow Lab tested the full install on macOS (Apple Silicon, Python 3.12) and confirmed a clean install in under two minutes.
Format-by-Format Guide
Here is what you get from each format, based on Effloow Lab's sandbox PoC runs.
HTML
HTML conversion is markitdown's strongest mode. The markdownify library handles the DOM traversal:
<h1>–<h6>→#headings (matching level)<table>→ GitHub-style pipe tables<a href>→[text](url)— links fully preserved<strong>/<b>→**bold**<blockquote>→> quoted text<ul>/<li>→* item<head><title>extracted asresult.title, not included in body
Size in our test: 1,091B raw HTML → 709B Markdown (35% smaller).
PPTX
Each slide becomes a Markdown block preceded by <!-- Slide number: N -->:
<!-- Slide number: 1 -->
# AI Document Processing Pipeline
Step 1: Ingest documents
Step 2: Convert with markitdown
Step 3: Chunk for RAG
Step 4: Embed and store in vector DB
Speaker notes are not included by default. If your slides rely on notes for context, you lose that. Size reduction in our test: 29,233B → 289B (99% smaller).
XLSX
Each sheet becomes a section with the sheet name as an H2 heading, followed by a pipe table:
## Tool Comparison
| Tool | Stars | License |
| --- | --- | --- |
| markitdown | 117K+ | MIT |
Size: 5,098B → 386B (92% smaller).
DOCX
Heading paragraph styles map to Markdown heading levels. Bold runs become **text**. Tables have basic support. Complex layouts may not preserve order correctly, but for standard business documents the output is clean.
Text-layer PDFs extract cleanly. Complex multi-column layouts may lose reading order because pdfplumber processes text in position order, not visual flow order. Scanned PDFs produce no useful output without OCR — for those, you need the Azure AI Document Intelligence integration.
JSON
JSON files are returned as-is — the same content, not a Markdown representation. This is a design decision: JSON is already structured text, so markitdown passes it through unchanged. If you need JSON fields converted to Markdown prose, you'll need to handle that yourself.
Python API
from markitdown import MarkItDown
md = MarkItDown()
# Convert a local file
result = md.convert("quarterly_report.pdf")
print(result.text_content) # Clean Markdown
print(result.title) # Document title if available
# Convert a URL
result = md.convert_url("https://example.com/docs/api-reference")
print(result.text_content)
# Convert a stream
with open("presentation.pptx", "rb") as f:
result = md.convert_stream(f, file_extension=".pptx")
The convert() method accepts file paths, URLs (which it fetches internally), and file-like objects. Format detection uses the file extension and magika (a content-type classifier).
Building a RAG Preprocessing Pipeline
Here is a typical integration pattern where markitdown sits at the ingestion step:
from markitdown import MarkItDown
from pathlib import Path
def ingest_documents(doc_dir: str) -> list[dict]:
md = MarkItDown()
documents = []
for path in Path(doc_dir).rglob("*"):
if path.suffix in {".pdf", ".docx", ".pptx", ".xlsx", ".html", ".csv"}:
result = md.convert(str(path))
documents.append({
"source": str(path),
"title": result.title or path.name,
"content": result.text_content,
"format": path.suffix,
})
return documents
# Then chunk, embed, and store
docs = ingest_documents("./knowledge-base/")
This pipeline handles heterogeneous document collections without format-specific branching logic. You get Markdown out regardless of whether the input was a PDF, a spreadsheet, or a presentation.
For chunking after conversion, heading boundaries (## Section) work well as natural split points — a pattern that works better than character-count chunking because it preserves semantic units. This is covered in more detail in our vector database guide.
MCP Server Mode
v0.1.5 added a Model Context Protocol server, which means markitdown can now run as a tool that LLM clients like Claude Desktop discover and invoke directly:
pip install markitdown[all]
python -m markitdown.mcp
Once running, MCP clients can call the conversion tool without writing any Python code. The document conversion happens server-side, and the client receives clean Markdown. This pairs well with the broader MCP ecosystem — markitdown becomes one more tool in an agent's toolbelt.
markitdown vs the Field
| Library | Best For | PDF Quality | Install Size | License |
|---|---|---|---|---|
| markitdown | LLM preprocessing, simple docs | Text-layer only | ~251MB (all) | MIT |
| Docling (IBM) | Research, exact table preservation | Excellent (TableFormer) | ~1,032MB | MIT |
| Unstructured | Enterprise, mixed document types | 88%+ reliability | ~146MB | Apache 2.0 |
| PyPDF | Simple PDF text only | Basic | Small | BSD |
| Kreuzberg | High-volume, edge environments | Good | ~71MB | MIT |
The comparison row marked "Best For Enterprise" is Unstructured because it has broader reliability across unusual document layouts. But markitdown wins on simplicity and speed for the common case: well-structured business documents, web pages, spreadsheets, and presentations.
Docling deserves mention for one specific case: if you're extracting data from academic papers, financial tables, or any document where table fidelity is critical, Docling's TableFormer model preserves structure that markitdown loses. The tradeoff is a 1GB+ model download and much slower processing.
Azure AI Document Intelligence Integration
For scanned PDFs or complex image-heavy documents, markitdown integrates with Azure AI Document Intelligence:
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential
from markitdown import MarkItDown
from markitdown.converters import DocumentIntelligenceConverter
client = DocumentIntelligenceClient(
endpoint="https://your-resource.cognitiveservices.azure.com/",
credential=AzureKeyCredential("your-key")
)
md = MarkItDown()
md.register_converter(DocumentIntelligenceConverter(client))
result = md.convert("scanned_contract.pdf")
This replaces the local pdfplumber converter with Azure's cloud OCR for that specific call. The local conversion path is used for everything else — you only pay Azure API costs for documents that need it.
Common Mistakes
Expecting JSON transformation. JSON files are returned unchanged. markitdown is not a JSON-to-prose converter.
Assuming scanned PDFs work out of the box. They don't. Install the [az-doc-intel] extra and configure a client, or pre-process scans with an OCR tool before passing to markitdown.
Using [all] in production containers. The full install includes audio transcription and Azure SDKs you may not need, adding ~700MB to your container. Use selective extras: pip install "markitdown[pdf,docx,pptx,xlsx]" for document-only pipelines.
Ignoring speaker notes. PPTX conversion skips speaker notes. If the notes contain important context (common in technical presentations), extract them separately with python-pptx before conversion.
Passing large files directly to the LLM. Even after conversion, a 500-page PDF becomes a very long Markdown document. markitdown converts; chunking is your responsibility. Always chunk after conversion.
FAQ
Q: Does markitdown support streaming for large files?
The convert_stream() method accepts file-like objects, but processing is not incremental — markitdown reads the entire document, converts it, and returns the result. For very large files, this happens in memory. There is no chunk-streaming mode.
Q: Can markitdown handle password-protected PDFs?
No. Password-protected PDFs will raise an exception. You need to decrypt them first (tools like pikepdf can handle this) before passing to markitdown.
Q: Is markitdown thread-safe?
The MarkItDown instance is stateless after initialization, so you can share one instance across threads safely. Each convert() call is independent.
Q: Does the MCP server work with Claude Code?
Yes. Claude Code supports MCP servers. Once python -m markitdown.mcp is running, you can add it to your Claude Code MCP config and invoke document conversion as a tool in your sessions.
Q: What happens with embedded images in DOCX or PPTX?
By default, embedded images are skipped entirely — only text content is extracted. With the Azure AI Document Intelligence integration, images in documents can be analyzed and described. Without it, you lose embedded chart content, diagrams, and image-only slides.
Key Takeaways
markitdown is the right tool when your goal is "get clean Markdown out of this document, fast." It ships with a minimal footprint for basic formats, handles the common enterprise document types with [all], and its v0.1.5 MCP server makes it natively accessible to any LLM client that speaks the Model Context Protocol.
The limits are real: JSON is a pass-through, scanned PDFs need Azure help, and complex multi-column layouts may shuffle reading order. For those edge cases, Docling or Unstructured are the right picks.
Effloow Lab ran a full sandbox PoC (see data/lab-runs/microsoft-markitdown-document-processing-llm-guide-2026.md) and confirmed clean installs and conversions for HTML, XLSX, PPTX, DOCX, CSV, and text-layer PDFs on Apple Silicon with Python 3.12.
For teams building RAG pipelines and agentic systems that handle diverse document types, markitdown's three-line API and broad format coverage make it a strong default choice for the ingestion layer. Pair it with a vector database (see our vector database comparison) and observability via Langfuse for a complete preprocessing stack.
markitdown is the fastest path from "document collection" to "LLM-ready Markdown" for standard business formats. Install with pip install "markitdown[all]", pass any file, get clean Markdown. Skip it when you need perfect table fidelity or scanned PDF OCR without Azure — those cases belong to Docling or Unstructured respectively.
Need content like this
for your blog?
We run AI-powered technical blogs. Start with a free 3-article pilot.