← Back to article
XLSX → Markdown (requires
PPTX → Markdown (requires
DOCX → Markdown (requires
PDF → Markdown (requires
Open article →
Microsoft Markitdown Document Processing Llm Guide 2026
Evidence notes document the bounded local or source-based checks behind an Effloow article. They are not product endorsements, legal advice, or benchmark claims.
Lab Run: microsoft/markitdown
Date: 2026-04-27
Track: sandbox-poc
Slug: microsoft-markitdown-document-processing-llm-guide-2026
Environment: macOS Darwin 24.6.0, Python 3.12, Apple Silicon
Install
pip3 install markitdown
# => Successfully installed markitdown-0.1.5
# Dependencies: beautifulsoup4, charset-normalizer, defusedxml, magika, markdownify, requests
pip3 install "markitdown[all]"
# Adds: mammoth (DOCX), pdfplumber (PDF), python-pptx (PPTX),
# speechrecognition (audio), youtube-transcript-api (YouTube),
# azure-ai-documentintelligence (AI-enhanced OCR), xlrd, olefile
Format Tests
HTML → Markdown
python3 -m markitdown test.html
Result: SUCCESS
<h1>→#,<h2>→##<table>→ GitHub Flavored Markdown pipe table<strong>→**bold**<a href>→[text](url)— links fully preserved<blockquote>→> quote<head><title>extracted asresult.title(not in output body)- Size: 1,091B raw → 709B markdown (35% smaller)
CSV → Markdown
Result: SUCCESS
- Converts to pipe-table format automatically
- Size: 392B markdown from 5-column CSV
XLSX → Markdown (requires markitdown[all])
Result: SUCCESS
- Sheet name becomes
## Sheet NameH2 header - Data rows as pipe table
- Size: 5,098B → 386B (92% smaller)
PPTX → Markdown (requires markitdown[all])
Result: SUCCESS
- Each slide gets
<!-- Slide number: N -->HTML comment - Slide title →
# heading - Bullet content as plain text
- Speaker notes: NOT included by default
- Size: 29,233B → 289B (99% smaller)
DOCX → Markdown (requires markitdown[all])
Result: SUCCESS
- Heading styles preserved (H1, H2, etc.)
- Bold runs →
**text** - Paragraphs separated by blank lines
- Complex tables: basic support
PDF → Markdown (requires markitdown[all])
Result: PARTIAL
- Text-layer PDFs: content extracted
- Complex multi-column PDFs: may lose reading order
- Scanned PDFs: requires Azure AI Document Intelligence for accuracy
- No header/title metadata extracted
JSON
Result: PASS-THROUGH (not transformed)
- JSON files returned as-is (same JSON string, pretty-printed)
- Not converted to Markdown structure
Python API
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content) # Markdown string
print(result.title) # Document title if available (HTML only)
# URL conversion
result = md.convert_url("https://example.com/page")
Comprehensive Format Test Results
| Format | Raw Size | Markdown Size | Reduction | Title Extracted |
|---|---|---|---|---|
| HTML | 1,091B | 709B | 35% | Yes |
| CSV | ~150B | 392B | — | No |
| XLSX | 5,098B | 386B | 92% | No |
| PPTX | 29,233B | 289B | 99% | No |
| DOCX | ~8,000B | 294B | ~96% | No |
| ~3,000B | 102B | ~97% | No | |
| JSON | 276B | 276B | 0% | No |
Limitations Found
- JSON not transformed — returns identical JSON text, not a Markdown representation
- PDF accuracy — text-layer PDFs work, but complex multi-column layouts lose reading order; scanned PDFs need Azure AI Document Intelligence extra
- PPTX speaker notes — not included in default output
- Image OCR — requires
pytesseractsystem install (not included even in[all]) - Audio transcription — requires
speechrecognition+ffmpegsystem install - URL conversion — fetches page as HTML, so dynamic JS-rendered content may be incomplete
- No table-of-contents extraction — DOCX/PDF section hierarchy not always preserved
What Worked Well
- Drop-in CLI (
python3 -m markitdown file.ext) — no config needed - Python API is simple (3 lines to convert)
- HTML conversion is excellent — tables, links, blockquotes all clean
- XLSX → pipe table is perfect for LLM context
- PPTX slide-by-slide numbering helps LLMs reason about structure
convert_url()handles fetching + conversion in one call
RAG Pipeline Integration Test
from markitdown import MarkItDown
def ingest_document(path: str) -> str:
md = MarkItDown()
result = md.convert(path)
return result.text_content
# Works for: .html, .pdf, .docx, .pptx, .xlsx, .csv
# Then: chunk -> embed -> store
Summary
markitdown 0.1.5 is production-ready for HTML, XLSX, CSV, PPTX, and DOCX conversion. PDF is reliable for text-layer documents. JSON is a pass-through. Effloow Lab confirmed install, conversion, and API integration in one sandbox session. No API keys required for local files.
Read the article
This note supports the public article and records what was actually checked.