Tutorials

Building a Multi-Model RAG Pipeline with Open WebUI and Ollama

Use different models for embedding, retrieval, and generation in Open WebUI to build a RAG pipeline that balances quality, speed, and cost.

Robson PereiraMay 31, 202612 min read
Diagram showing multi-model RAG pipeline with separate embedding, retrieval, and generation stages.

Building a Multi-Model RAG Pipeline with Open WebUI and Ollama

Most RAG setups use a single model for everything: generating embeddings, retrieving documents, and writing answers. That works, but it leaves performance on the table. A multi-model pipeline uses specialised models for each stage — a fast embedding model for indexing, a lightweight reranker for relevance scoring, and a capable generation model for the final answer.

This guide walks through configuring such a pipeline in Open WebUI with Ollama as the model backend.

Why Multi-Model RAG Works Better

| Stage | Single-model approach | Multi-model approach |

|-------|----------------------|---------------------|

| Embedding | Heavy general model (slow indexing) | Lightweight dedicated embedding model (fast) |

| Retrieval | Same model used for everything | Optional reranker improves relevance |

| Generation | Same model must do all jobs | Capable model focused only on generating answers |

Separating concerns means each model does what it does best. A small embedding model like Snowflake Arctic Embed or BGE-M3 is faster and cheaper than using a 7B chat model for embeddings.

For an introduction to RAG fundamentals, see Build a Local RAG Pipeline That Actually Answers Questions.

Selecting Models for Each Stage

Embedding model

The embedding model converts documents into vector representations. Key criteria:

| Model | Dimensions | Context | Speed | Best for |

|-------|-----------|---------|-------|----------|

| snowflake-arctic-embed:xs | 384 | 512 | Very fast | General purpose, small docs |

| nomic-embed-text:v1.5 | 768 | 8192 | Fast | Long documents |

| bge-m3 | 1024 | 8192 | Moderate | Multilingual, high quality |

| mxbai-embed-large:v1 | 1024 | 512 | Fast | English, high quality |

Install via Ollama:

```bash

ollama pull snowflake-arctic-embed:xs

ollama pull nomic-embed-text:v1.5

```

A reranker scores retrieved chunks by relevance to the query, improving result quality:

Open WebUI can use cross-encoder rerankers through its built-in reranking configuration.

Generation model

This is the model that writes answers based on retrieved context. Choose based on your hardware and quality needs:

```bash

ollama pull llama3.2:7b-q4_k_m # Good balance for most setups

ollama pull qwen3:8b # Strong reasoning

ollama pull deepseek-v4-pro:q4_k_m # Best quality, needs more RAM

```

Configuring Open WebUI for Multi-Model RAG

Step 1: Set the embedding model

In Open WebUI, navigate to **Admin Settings -> Documents -> Embedding Model**. Select the model you pulled in Ollama.

```bash

Verify Ollama is serving the embedding model

curl http://localhost:11434/api/embeddings -d '{

"model": "snowflake-arctic-embed:xs",

"prompt": "test sentence"

}'

```

Step 2: Configure the reranker

Open WebUI supports reranking in its RAG pipeline. Go to **Admin Settings -> Documents -> Reranking** and enable it with your preferred cross-encoder model.

Step 3: Set the default chat model

In **Admin Settings -> General -> Default Model**, set your chosen generation model.

Step 4: Test the pipeline

Upload a document, ask a question about it, and inspect which chunks were retrieved.

Performance Tuning

Chunk size selection

Match your chunk size to your embedding model's context limit:

| Embedding model | Recommended chunk size | Overlap |

|-----------------|-----------------------|--------|

| snowflake-arctic-embed:xs | 384 tokens | 64 tokens |

| nomic-embed-text:v1.5 | 1024 tokens | 128 tokens |

| bge-m3 | 1024 tokens | 128 tokens |

For a deeper discussion of chunk parameters, see Tune Chunk Size and Overlap for Better Retrieval.

Monitoring Pipeline Quality

Track these metrics:

  • **Retrieval precision:** What fraction of retrieved chunks are actually relevant?
  • **Answer relevance:** Are the generated answers using the retrieved context effectively?
  • **Latency per stage:** How long does embedding, retrieval, reranking, and generation each take?

Common Multi-Model Issues

Embedding dimension mismatch

The embedding model's output dimensions must match what Open WebUI expects. Most modern models use 384, 768, or 1024 dimensions.

Reranker not improving results

If adding a reranker doesn't improve answer quality, the reranker model may be too small or poorly suited to your document domain.

Slow generation after retrieval

Limit the number of chunks passed to the generation model in Open WebUI's RAG settings.

Conclusion

A multi-model RAG pipeline separates concerns across embedding, reranking, and generation, giving you better retrieval quality and faster indexing than a one-model-fits-all approach. Open WebUI makes this practical by letting you configure each stage independently.

FAQ

Can I use different Ollama hosts for different stages?

Yes. Open WebUI can connect to different Ollama instances for embedding and generation if you configure them separately.

Does multi-model RAG use more memory?

Models are loaded one at a time for each stage, so peak memory is determined by your largest model.

Which embedding model is best for code documentation?

nomic-embed-text:v1.5 handles longer contexts well, making it suitable for code documentation.

**Sources:**

Related articles