Lab Run: microsoft/markitdown

Date: 2026-04-27
Track: sandbox-poc
Slug: microsoft-markitdown-document-processing-llm-guide-2026
Environment: macOS Darwin 24.6.0, Python 3.12, Apple Silicon

Install

pip3 install markitdown
# => Successfully installed markitdown-0.1.5
# Dependencies: beautifulsoup4, charset-normalizer, defusedxml, magika, markdownify, requests

pip3 install "markitdown[all]"
# Adds: mammoth (DOCX), pdfplumber (PDF), python-pptx (PPTX),
#       speechrecognition (audio), youtube-transcript-api (YouTube),
#       azure-ai-documentintelligence (AI-enhanced OCR), xlrd, olefile

Format Tests

HTML → Markdown

python3 -m markitdown test.html

Result: SUCCESS

<h1> → #, <h2> → ##
<table> → GitHub Flavored Markdown pipe table
<strong> → **bold**
<a href> → [text](url) — links fully preserved
<blockquote> → > quote
<head><title> extracted as result.title (not in output body)
Size: 1,091B raw → 709B markdown (35% smaller)

CSV → Markdown

Result: SUCCESS

Converts to pipe-table format automatically
Size: 392B markdown from 5-column CSV

XLSX → Markdown (requires `markitdown[all]`)

Result: SUCCESS

Sheet name becomes ## Sheet Name H2 header
Data rows as pipe table
Size: 5,098B → 386B (92% smaller)

PPTX → Markdown (requires `markitdown[all]`)

Result: SUCCESS

Each slide gets  HTML comment
Slide title → # heading
Bullet content as plain text
Speaker notes: NOT included by default
Size: 29,233B → 289B (99% smaller)

DOCX → Markdown (requires `markitdown[all]`)

Result: SUCCESS

Heading styles preserved (H1, H2, etc.)
Bold runs → **text**
Paragraphs separated by blank lines
Complex tables: basic support

PDF → Markdown (requires `markitdown[all]`)

Result: PARTIAL

Text-layer PDFs: content extracted
Complex multi-column PDFs: may lose reading order
Scanned PDFs: requires Azure AI Document Intelligence for accuracy
No header/title metadata extracted

JSON

Result: PASS-THROUGH (not transformed)

JSON files returned as-is (same JSON string, pretty-printed)
Not converted to Markdown structure

Python API

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)  # Markdown string
print(result.title)          # Document title if available (HTML only)

# URL conversion
result = md.convert_url("https://example.com/page")

Comprehensive Format Test Results

Format	Raw Size	Markdown Size	Reduction	Title Extracted
HTML	1,091B	709B	35%	Yes
CSV	~150B	392B	—	No
XLSX	5,098B	386B	92%	No
PPTX	29,233B	289B	99%	No
DOCX	~8,000B	294B	~96%	No
PDF	~3,000B	102B	~97%	No
JSON	276B	276B	0%	No

Limitations Found

JSON not transformed — returns identical JSON text, not a Markdown representation
PDF accuracy — text-layer PDFs work, but complex multi-column layouts lose reading order; scanned PDFs need Azure AI Document Intelligence extra
PPTX speaker notes — not included in default output
Image OCR — requires pytesseract system install (not included even in [all])
Audio transcription — requires speechrecognition + ffmpeg system install
URL conversion — fetches page as HTML, so dynamic JS-rendered content may be incomplete
No table-of-contents extraction — DOCX/PDF section hierarchy not always preserved

What Worked Well

Drop-in CLI (python3 -m markitdown file.ext) — no config needed
Python API is simple (3 lines to convert)
HTML conversion is excellent — tables, links, blockquotes all clean
XLSX → pipe table is perfect for LLM context
PPTX slide-by-slide numbering helps LLMs reason about structure
convert_url() handles fetching + conversion in one call

RAG Pipeline Integration Test

from markitdown import MarkItDown

def ingest_document(path: str) -> str:
    md = MarkItDown()
    result = md.convert(path)
    return result.text_content

# Works for: .html, .pdf, .docx, .pptx, .xlsx, .csv
# Then: chunk -> embed -> store

Summary

markitdown 0.1.5 is production-ready for HTML, XLSX, CSV, PPTX, and DOCX conversion. PDF is reliable for text-layer documents. JSON is a pass-through. Effloow Lab confirmed install, conversion, and API integration in one sandbox session. No API keys required for local files.

Microsoft Markitdown Document Processing Llm Guide 2026

Lab Run: microsoft/markitdown

Install

Format Tests

HTML → Markdown

CSV → Markdown

XLSX → Markdown (requires `markitdown[all]`)

PPTX → Markdown (requires `markitdown[all]`)

DOCX → Markdown (requires `markitdown[all]`)

PDF → Markdown (requires `markitdown[all]`)

JSON

Python API

Comprehensive Format Test Results

Limitations Found

What Worked Well

RAG Pipeline Integration Test

Summary

Read the article

Lab Run: microsoft/markitdown

Install

Format Tests

HTML → Markdown

CSV → Markdown

XLSX → Markdown (requires markitdown[all])

PPTX → Markdown (requires markitdown[all])

DOCX → Markdown (requires markitdown[all])

PDF → Markdown (requires markitdown[all])

JSON

Python API

Comprehensive Format Test Results

Limitations Found

What Worked Well

RAG Pipeline Integration Test

Summary

Read the article

XLSX → Markdown (requires `markitdown[all]`)

PPTX → Markdown (requires `markitdown[all]`)

DOCX → Markdown (requires `markitdown[all]`)

PDF → Markdown (requires `markitdown[all]`)