Tutorials
Use Microsoft MarkItDown for Local RAG: Convert Any Document to Markdown
Turn PDFs, DOCX, PPTX, images, and audio into clean Markdown for your local RAG pipeline with Microsoft's 133K-star MarkItDown tool.

Use Microsoft MarkItDown for Local RAG: Convert Any Document to Markdown
If you run a self-hosted RAG (Retrieval-Augmented Generation) pipeline, you have probably faced the same problem: your local LLM can answer questions from clean text, but your real-world documents live in PDFs, PowerPoint decks, Excel spreadsheets, and Word files. Converting those documents into a format your LLM can understand — without losing structure — is the missing link between your local model and your actual data.
**Microsoft MarkItDown** solves this. With **133,000+ GitHub stars** and an MIT license, it is a lightweight Python utility that converts virtually any file format into clean Markdown. Built by the Microsoft AutoGen team, it is designed specifically for LLM ingestion: the output preserves headings, lists, tables, and links while stripping unnecessary formatting that wastes tokens.
Why Markdown for RAG?
Markdown is the native language of modern LLMs. Models like GPT-4o, Claude, Qwen, and DeepSeek have been trained on vast amounts of Markdown-formatted text and understand it well. Markdown is also highly token-efficient — a table rendered as Markdown uses far fewer tokens than the same table in HTML or a PDF stream.
For RAG pipelines, clean Markdown means:
- **Better chunking** — hierarchical headings give you natural split points
- **Higher retrieval quality** — embedding models work better on well-structured text
- **Lower token costs** — no wasted tokens on invisible PDF layout markup
- **Easier debugging** — you can read exactly what your LLM sees
What MarkItDown Can Convert
MarkItDown currently supports these input formats:
| Format | Extension | Feature |
|--------|-----------|---------|
| PDF | .pdf | Full text extraction with layout preservation |
| Word | .docx | Headings, lists, tables, links |
| PowerPoint | .pptx | Slide content, speaker notes |
| Excel | .xlsx | Worksheets, cell data (tables) |
| Images | .jpg, .png, etc. | EXIF metadata + OCR (via Tesseract) |
| Audio | .mp3, .wav, etc. | EXIF metadata + speech transcription |
| HTML | .html | Full structure to Markdown |
| CSV/JSON/XML | .csv, .json, .xml | Structured data rendered as tables |
| YouTube URLs | — | Transcript extraction |
| EPUB | .epub | Full ebook content |
| ZIP | .zip | Recursive processing of contents |
This makes it the most comprehensive open-source document-to-Markdown converter available for local AI workflows.
Installation
```bash
pip install 'markitdown[all]'
```
This installs all optional dependencies. If you only need specific formats:
```bash
pip install 'markitdown[pdf, docx, pptx]'
```
Python 3.10 or higher is required. Using a virtual environment is recommended — see Docker Setup for Local AI Tools for containerised approaches.
Basic Usage
Command-Line Interface
The simplest way to use MarkItDown is from the terminal:
```bash
Convert a PDF
markitdown quarterly-report.pdf > report.md
Specify output file
markitdown presentation.pptx -o slides.md
Pipe content
cat meeting-notes.docx | markitdown
```
Python API
For integration into your RAG pipeline, use the Python API:
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("annual-report.pdf")
markdown_text = result.text_content
Process multiple files
import os
for filename in os.listdir("./documents"):
if filename.endswith((".pdf", ".docx", ".pptx")):
result = md.convert(f"./documents/{filename}")
with open(f"./output/{filename}.md", "w") as f:
f.write(result.text_content)
```
Building a RAG Pipeline with MarkItDown
Here is a practical workflow for feeding documents into a local RAG system:
Step 1: Convert Your Document Collection
```bash
Convert an entire directory of mixed-format documents
for file in data/*.{pdf,docx,pptx,xlsx}; do
markitdown "$file" -o "markdown/$(basename "$file").md"
done
```
Step 2: Chunk the Markdown Output
Clean Markdown makes chunking straightforward — split on `##` or `###` headings for semantic chunks:
```python
import re
def chunk_markdown(text, max_chars=1500):
chunks = []
sections = re.split(r'(?=^#{2,3} )', text, flags=re.MULTILINE)
for section in sections:
if len(section) > max_chars:
Further split long sections
for para in section.split('\n\n'):
if para.strip():
chunks.append(para.strip())
elif section.strip():
chunks.append(section.strip())
return chunks
```
Step 3: Embed and Store
Feed your chunks into a local embedding pipeline. For embedding model recommendations, see our Best Embedding Models for Local RAG Systems guide.
Step 4: Query
When a user asks a question, retrieve relevant chunks and feed them to your local LLM alongside the prompt. Tools like Open WebUI and AnythingLLM can ingest Markdown files natively.
Docker Integration
For team environments, run MarkItDown as a service:
```dockerfile
FROM python:3.12-slim
RUN pip install 'markitdown[all]'
COPY ./documents /input
COPY ./output /output
CMD ["sh", "-c", "for f in /input/*; do markitdown "$f" -o "/output/$(basename $f).md"; done"]
```
```bash
docker build -t markitdown-service .
docker run --rm -v $(pwd)/docs:/input -v $(pwd)/md:/output markitdown-service
```
For multi-service stacks, combine this with your local AI Docker Compose setup.
Security Considerations
MarkItDown performs I/O with the same privileges as the current process. In untrusted environments:
- Use the narrowest conversion function needed (`convert_local()` or `convert_stream()`)
- Run the conversion in an isolated container or sandbox
- Validate inputs before processing
- Avoid processing untrusted documents on critical infrastructure
For production deployments, follow the How to Secure a Self-Hosted AI Server guidance.
Use Cases Beyond RAG
MarkItDown is useful beyond document retrieval:
- **Email-to-Markdown** — convert HTML email archives for local LLM analysis
- **Meeting notes pipeline** — transcribe audio (via Whisper), convert to Markdown, and index
- **Website scraping** — download pages as HTML, then convert to clean Markdown
- **Legacy document migration** — batch-convert old Office formats for modern tools
- **Ebook analysis** — convert EPUB files for LLM-powered book summarisation
Conclusion
Microsoft MarkItDown fills a critical gap in the self-hosted AI stack. By converting real-world documents — PDFs, Office files, images, and audio — into clean Markdown that LLMs understand natively, it makes local RAG pipelines practical. With 133K GitHub stars, an MIT license, and active maintenance from the AutoGen team, it is a safe, well-supported choice for any local document processing workflow.
**Sources:**


