Tutorials

Use Microsoft MarkItDown for Local RAG: Convert Any Document to Markdown

Turn PDFs, DOCX, PPTX, images, and audio into clean Markdown for your local RAG pipeline with Microsoft's 133K-star MarkItDown tool.

Robson PereiraMay 31, 20268 min read
Microsoft MarkItDown converting documents to Markdown for a local RAG pipeline.

Use Microsoft MarkItDown for Local RAG: Convert Any Document to Markdown

If you run a self-hosted RAG (Retrieval-Augmented Generation) pipeline, you have probably faced the same problem: your local LLM can answer questions from clean text, but your real-world documents live in PDFs, PowerPoint decks, Excel spreadsheets, and Word files. Converting those documents into a format your LLM can understand — without losing structure — is the missing link between your local model and your actual data.

**Microsoft MarkItDown** solves this. With **133,000+ GitHub stars** and an MIT license, it is a lightweight Python utility that converts virtually any file format into clean Markdown. Built by the Microsoft AutoGen team, it is designed specifically for LLM ingestion: the output preserves headings, lists, tables, and links while stripping unnecessary formatting that wastes tokens.

Why Markdown for RAG?

Markdown is the native language of modern LLMs. Models like GPT-4o, Claude, Qwen, and DeepSeek have been trained on vast amounts of Markdown-formatted text and understand it well. Markdown is also highly token-efficient — a table rendered as Markdown uses far fewer tokens than the same table in HTML or a PDF stream.

For RAG pipelines, clean Markdown means:

  • **Better chunking** — hierarchical headings give you natural split points
  • **Higher retrieval quality** — embedding models work better on well-structured text
  • **Lower token costs** — no wasted tokens on invisible PDF layout markup
  • **Easier debugging** — you can read exactly what your LLM sees

What MarkItDown Can Convert

MarkItDown currently supports these input formats:

| Format | Extension | Feature |

|--------|-----------|---------|

| PDF | .pdf | Full text extraction with layout preservation |

| Word | .docx | Headings, lists, tables, links |

| PowerPoint | .pptx | Slide content, speaker notes |

| Excel | .xlsx | Worksheets, cell data (tables) |

| Images | .jpg, .png, etc. | EXIF metadata + OCR (via Tesseract) |

| Audio | .mp3, .wav, etc. | EXIF metadata + speech transcription |

| HTML | .html | Full structure to Markdown |

| CSV/JSON/XML | .csv, .json, .xml | Structured data rendered as tables |

| YouTube URLs | — | Transcript extraction |

| EPUB | .epub | Full ebook content |

| ZIP | .zip | Recursive processing of contents |

This makes it the most comprehensive open-source document-to-Markdown converter available for local AI workflows.

Installation

```bash

pip install 'markitdown[all]'

```

This installs all optional dependencies. If you only need specific formats:

```bash

pip install 'markitdown[pdf, docx, pptx]'

```

Python 3.10 or higher is required. Using a virtual environment is recommended — see Docker Setup for Local AI Tools for containerised approaches.

Basic Usage

Command-Line Interface

The simplest way to use MarkItDown is from the terminal:

```bash

Convert a PDF

markitdown quarterly-report.pdf > report.md

Specify output file

markitdown presentation.pptx -o slides.md

Pipe content

cat meeting-notes.docx | markitdown

```

Python API

For integration into your RAG pipeline, use the Python API:

```python

from markitdown import MarkItDown

md = MarkItDown()

result = md.convert("annual-report.pdf")

markdown_text = result.text_content

Process multiple files

import os

for filename in os.listdir("./documents"):

if filename.endswith((".pdf", ".docx", ".pptx")):

result = md.convert(f"./documents/{filename}")

with open(f"./output/{filename}.md", "w") as f:

f.write(result.text_content)

```

Building a RAG Pipeline with MarkItDown

Here is a practical workflow for feeding documents into a local RAG system:

Step 1: Convert Your Document Collection

```bash

Convert an entire directory of mixed-format documents

for file in data/*.{pdf,docx,pptx,xlsx}; do

markitdown "$file" -o "markdown/$(basename "$file").md"

done

```

Step 2: Chunk the Markdown Output

Clean Markdown makes chunking straightforward — split on `##` or `###` headings for semantic chunks:

```python

import re

def chunk_markdown(text, max_chars=1500):

chunks = []

sections = re.split(r'(?=^#{2,3} )', text, flags=re.MULTILINE)

for section in sections:

if len(section) > max_chars:

Further split long sections

for para in section.split('\n\n'):

if para.strip():

chunks.append(para.strip())

elif section.strip():

chunks.append(section.strip())

return chunks

```

Step 3: Embed and Store

Feed your chunks into a local embedding pipeline. For embedding model recommendations, see our Best Embedding Models for Local RAG Systems guide.

Step 4: Query

When a user asks a question, retrieve relevant chunks and feed them to your local LLM alongside the prompt. Tools like Open WebUI and AnythingLLM can ingest Markdown files natively.

Docker Integration

For team environments, run MarkItDown as a service:

```dockerfile

FROM python:3.12-slim

RUN pip install 'markitdown[all]'

COPY ./documents /input

COPY ./output /output

CMD ["sh", "-c", "for f in /input/*; do markitdown "$f" -o "/output/$(basename $f).md"; done"]

```

```bash

docker build -t markitdown-service .

docker run --rm -v $(pwd)/docs:/input -v $(pwd)/md:/output markitdown-service

```

For multi-service stacks, combine this with your local AI Docker Compose setup.

Security Considerations

MarkItDown performs I/O with the same privileges as the current process. In untrusted environments:

  • Use the narrowest conversion function needed (`convert_local()` or `convert_stream()`)
  • Run the conversion in an isolated container or sandbox
  • Validate inputs before processing
  • Avoid processing untrusted documents on critical infrastructure

For production deployments, follow the How to Secure a Self-Hosted AI Server guidance.

Use Cases Beyond RAG

MarkItDown is useful beyond document retrieval:

  • **Email-to-Markdown** — convert HTML email archives for local LLM analysis
  • **Meeting notes pipeline** — transcribe audio (via Whisper), convert to Markdown, and index
  • **Website scraping** — download pages as HTML, then convert to clean Markdown
  • **Legacy document migration** — batch-convert old Office formats for modern tools
  • **Ebook analysis** — convert EPUB files for LLM-powered book summarisation

Conclusion

Microsoft MarkItDown fills a critical gap in the self-hosted AI stack. By converting real-world documents — PDFs, Office files, images, and audio — into clean Markdown that LLMs understand natively, it makes local RAG pipelines practical. With 133K GitHub stars, an MIT license, and active maintenance from the AutoGen team, it is a safe, well-supported choice for any local document processing workflow.

**Sources:**

Related articles