Skip to content
Effloow
← Back to article
EFFLOOW LAB LAB-RUN

Microsoft Markitdown Document Processing Llm Guide 2026

Evidence notes document the bounded local or source-based checks behind an Effloow article. They are not product endorsements, legal advice, or benchmark claims.

Lab Run: microsoft/markitdown

Date: 2026-04-27
Track: sandbox-poc
Slug: microsoft-markitdown-document-processing-llm-guide-2026
Environment: macOS Darwin 24.6.0, Python 3.12, Apple Silicon

Install

pip3 install markitdown
# => Successfully installed markitdown-0.1.5
# Dependencies: beautifulsoup4, charset-normalizer, defusedxml, magika, markdownify, requests

pip3 install "markitdown[all]"
# Adds: mammoth (DOCX), pdfplumber (PDF), python-pptx (PPTX),
#       speechrecognition (audio), youtube-transcript-api (YouTube),
#       azure-ai-documentintelligence (AI-enhanced OCR), xlrd, olefile

Format Tests

HTML → Markdown

python3 -m markitdown test.html

Result: SUCCESS

  • <h1>#, <h2>##
  • <table> → GitHub Flavored Markdown pipe table
  • <strong>**bold**
  • <a href>[text](url) — links fully preserved
  • <blockquote>> quote
  • <head><title> extracted as result.title (not in output body)
  • Size: 1,091B raw → 709B markdown (35% smaller)

CSV → Markdown

Result: SUCCESS

  • Converts to pipe-table format automatically
  • Size: 392B markdown from 5-column CSV

XLSX → Markdown (requires markitdown[all])

Result: SUCCESS

  • Sheet name becomes ## Sheet Name H2 header
  • Data rows as pipe table
  • Size: 5,098B → 386B (92% smaller)

PPTX → Markdown (requires markitdown[all])

Result: SUCCESS

  • Each slide gets <!-- Slide number: N --> HTML comment
  • Slide title → # heading
  • Bullet content as plain text
  • Speaker notes: NOT included by default
  • Size: 29,233B → 289B (99% smaller)

DOCX → Markdown (requires markitdown[all])

Result: SUCCESS

  • Heading styles preserved (H1, H2, etc.)
  • Bold runs → **text**
  • Paragraphs separated by blank lines
  • Complex tables: basic support

PDF → Markdown (requires markitdown[all])

Result: PARTIAL

  • Text-layer PDFs: content extracted
  • Complex multi-column PDFs: may lose reading order
  • Scanned PDFs: requires Azure AI Document Intelligence for accuracy
  • No header/title metadata extracted

JSON

Result: PASS-THROUGH (not transformed)

  • JSON files returned as-is (same JSON string, pretty-printed)
  • Not converted to Markdown structure

Python API

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)  # Markdown string
print(result.title)          # Document title if available (HTML only)

# URL conversion
result = md.convert_url("https://example.com/page")

Comprehensive Format Test Results

Format Raw Size Markdown Size Reduction Title Extracted
HTML 1,091B 709B 35% Yes
CSV ~150B 392B No
XLSX 5,098B 386B 92% No
PPTX 29,233B 289B 99% No
DOCX ~8,000B 294B ~96% No
PDF ~3,000B 102B ~97% No
JSON 276B 276B 0% No

Limitations Found

  1. JSON not transformed — returns identical JSON text, not a Markdown representation
  2. PDF accuracy — text-layer PDFs work, but complex multi-column layouts lose reading order; scanned PDFs need Azure AI Document Intelligence extra
  3. PPTX speaker notes — not included in default output
  4. Image OCR — requires pytesseract system install (not included even in [all])
  5. Audio transcription — requires speechrecognition + ffmpeg system install
  6. URL conversion — fetches page as HTML, so dynamic JS-rendered content may be incomplete
  7. No table-of-contents extraction — DOCX/PDF section hierarchy not always preserved

What Worked Well

  • Drop-in CLI (python3 -m markitdown file.ext) — no config needed
  • Python API is simple (3 lines to convert)
  • HTML conversion is excellent — tables, links, blockquotes all clean
  • XLSX → pipe table is perfect for LLM context
  • PPTX slide-by-slide numbering helps LLMs reason about structure
  • convert_url() handles fetching + conversion in one call

RAG Pipeline Integration Test

from markitdown import MarkItDown

def ingest_document(path: str) -> str:
    md = MarkItDown()
    result = md.convert(path)
    return result.text_content

# Works for: .html, .pdf, .docx, .pptx, .xlsx, .csv
# Then: chunk -> embed -> store

Summary

markitdown 0.1.5 is production-ready for HTML, XLSX, CSV, PPTX, and DOCX conversion. PDF is reliable for text-layer documents. JSON is a pass-through. Effloow Lab confirmed install, conversion, and API integration in one sandbox session. No API keys required for local files.

Read the article

This note supports the public article and records what was actually checked.

Open article →