Tutorials
Building a Multi-Model RAG Pipeline with Open WebUI and Ollama
Use different models for embedding, retrieval, and generation in Open WebUI to build a RAG pipeline that balances quality, speed, and cost.

Building a Multi-Model RAG Pipeline with Open WebUI and Ollama
Most RAG setups use a single model for everything: generating embeddings, retrieving documents, and writing answers. That works, but it leaves performance on the table. A multi-model pipeline uses specialised models for each stage — a fast embedding model for indexing, a lightweight reranker for relevance scoring, and a capable generation model for the final answer.
This guide walks through configuring such a pipeline in Open WebUI with Ollama as the model backend.
Why Multi-Model RAG Works Better
| Stage | Single-model approach | Multi-model approach |
|-------|----------------------|---------------------|
| Embedding | Heavy general model (slow indexing) | Lightweight dedicated embedding model (fast) |
| Retrieval | Same model used for everything | Optional reranker improves relevance |
| Generation | Same model must do all jobs | Capable model focused only on generating answers |
Separating concerns means each model does what it does best. A small embedding model like Snowflake Arctic Embed or BGE-M3 is faster and cheaper than using a 7B chat model for embeddings.
For an introduction to RAG fundamentals, see Build a Local RAG Pipeline That Actually Answers Questions.
Selecting Models for Each Stage
Embedding model
The embedding model converts documents into vector representations. Key criteria:
| Model | Dimensions | Context | Speed | Best for |
|-------|-----------|---------|-------|----------|
| snowflake-arctic-embed:xs | 384 | 512 | Very fast | General purpose, small docs |
| nomic-embed-text:v1.5 | 768 | 8192 | Fast | Long documents |
| bge-m3 | 1024 | 8192 | Moderate | Multilingual, high quality |
| mxbai-embed-large:v1 | 1024 | 512 | Fast | English, high quality |
Install via Ollama:
```bash
ollama pull snowflake-arctic-embed:xs
ollama pull nomic-embed-text:v1.5
```
Reranker (optional but recommended)
A reranker scores retrieved chunks by relevance to the query, improving result quality:
Open WebUI can use cross-encoder rerankers through its built-in reranking configuration.
Generation model
This is the model that writes answers based on retrieved context. Choose based on your hardware and quality needs:
```bash
ollama pull llama3.2:7b-q4_k_m # Good balance for most setups
ollama pull qwen3:8b # Strong reasoning
ollama pull deepseek-v4-pro:q4_k_m # Best quality, needs more RAM
```
Configuring Open WebUI for Multi-Model RAG
Step 1: Set the embedding model
In Open WebUI, navigate to **Admin Settings -> Documents -> Embedding Model**. Select the model you pulled in Ollama.
```bash
Verify Ollama is serving the embedding model
curl http://localhost:11434/api/embeddings -d '{
"model": "snowflake-arctic-embed:xs",
"prompt": "test sentence"
}'
```
Step 2: Configure the reranker
Open WebUI supports reranking in its RAG pipeline. Go to **Admin Settings -> Documents -> Reranking** and enable it with your preferred cross-encoder model.
Step 3: Set the default chat model
In **Admin Settings -> General -> Default Model**, set your chosen generation model.
Step 4: Test the pipeline
Upload a document, ask a question about it, and inspect which chunks were retrieved.
Performance Tuning
Chunk size selection
Match your chunk size to your embedding model's context limit:
| Embedding model | Recommended chunk size | Overlap |
|-----------------|-----------------------|--------|
| snowflake-arctic-embed:xs | 384 tokens | 64 tokens |
| nomic-embed-text:v1.5 | 1024 tokens | 128 tokens |
| bge-m3 | 1024 tokens | 128 tokens |
For a deeper discussion of chunk parameters, see Tune Chunk Size and Overlap for Better Retrieval.
Monitoring Pipeline Quality
Track these metrics:
- **Retrieval precision:** What fraction of retrieved chunks are actually relevant?
- **Answer relevance:** Are the generated answers using the retrieved context effectively?
- **Latency per stage:** How long does embedding, retrieval, reranking, and generation each take?
Common Multi-Model Issues
Embedding dimension mismatch
The embedding model's output dimensions must match what Open WebUI expects. Most modern models use 384, 768, or 1024 dimensions.
Reranker not improving results
If adding a reranker doesn't improve answer quality, the reranker model may be too small or poorly suited to your document domain.
Slow generation after retrieval
Limit the number of chunks passed to the generation model in Open WebUI's RAG settings.
Conclusion
A multi-model RAG pipeline separates concerns across embedding, reranking, and generation, giving you better retrieval quality and faster indexing than a one-model-fits-all approach. Open WebUI makes this practical by letting you configure each stage independently.
FAQ
Can I use different Ollama hosts for different stages?
Yes. Open WebUI can connect to different Ollama instances for embedding and generation if you configure them separately.
Does multi-model RAG use more memory?
Models are loaded one at a time for each stage, so peak memory is determined by your largest model.
Which embedding model is best for code documentation?
nomic-embed-text:v1.5 handles longer contexts well, making it suitable for code documentation.
**Sources:**


