Models

Optimising Embedding Models for Domain-Specific Document Retrieval

Match embedding models to your document domain — code, medical, legal, or technical — for significantly better local RAG retrieval quality.

Robson PereiraMay 31, 202610 min read
Visual comparison of embedding vector spaces for different document domains.

Optimising Embedding Models for Domain-Specific Document Retrieval

Not all embedding models are created equal. A general-purpose embedding model trained on web text struggles with legal contracts, medical literature, or source code. Choosing an embedding model that understands your domain can dramatically improve retrieval quality.

This guide covers how to evaluate, select, and optimise embedding models for specific document domains.

Why Domain Matters for Embeddings

A model trained primarily on Wikipedia learns patterns from general English. When you feed it a Python function or a clinical trial report, it maps those documents into a vector space that does not capture the meaningful relationships in that domain.

For a general introduction, see Choosing the Best Embedding Model for Local Search.

Embedding Model Categories

General purpose

| Model | Dimensions | Context | Ollama available |

|-------|-----------|---------|-----------------|

| nomic-embed-text:v1.5 | 768 | 8192 | Yes |

| snowflake-arctic-embed:xs | 384 | 512 | Yes |

| bge-base-en-v1.5 | 768 | 512 | Via GGUF |

| mxbai-embed-large:v1 | 1024 | 512 | Yes |

Code-aware

| Model | Dimensions | Context | Best for |

|-------|-----------|---------|----------|

| voyage-code-2 | 1024 | 16000 | Multi-language code search |

| codebert | 768 | 512 | Python and Java |

Medical / biomedical

| Model | Dimensions | Best for |

|-------|-----------|----------|

| PubMedBERT | 768 | Biomedical literature search |

| BioBERT | 768 | Clinical text mining |

| MedCPT | 768 | Clinical trial matching |

| Model | Dimensions | Best for |

|-------|-----------|----------|

| Legal-BERT | 768 | Contract analysis, case law search |

| CaseLaw-BERT | 768 | Precedent retrieval |

Evaluating Embedding Quality

Test retrieval recall on your actual queries:

```python

def recall_at_k(queries, documents, relevant_pairs, model, k=5):

correct = 0

for query, relevant_doc_id in relevant_pairs:

query_vec = embed(query, model)

scores = [cosine_similarity(query_vec, embed(doc, model))

for doc in documents]

top_k = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k]

if relevant_doc_id in top_k:

correct += 1

return correct / len(queries)

```

Target metrics

| Domain | Minimum recall@5 | Target recall@5 |

|--------|-----------------|-----------------|

| General web content | 0.75 | 0.85 |

| Code documentation | 0.70 | 0.82 |

| Medical literature | 0.65 | 0.78 |

| Legal contracts | 0.60 | 0.75 |

Chunking Strategy by Domain

| Domain | Recommended chunk size | Overlap |

|--------|----------------------|--------|

| Code | 100-300 tokens | 25 tokens |

| Medical | 200-500 tokens | 50 tokens |

| Legal | 300-800 tokens | 100 tokens |

| General | 300-500 tokens | 50 tokens |

For more on chunking, see Tune Chunk Size and Overlap for Better Retrieval.

When to Use General vs Domain-Specific

| Document type | Recommendation |

|--------------|---------------|

| Mixed | General model like nomic-embed-text:v1.5 |

| Mostly source code | Code-aware model |

| Medical papers | PubMedBERT or BioBERT |

| Legal contracts | Legal-BERT |

Conclusion

Start with a general model to establish baseline performance, test a domain-specific alternative, and switch if retrieval quality improves measurably. For most mixed-document setups, nomic-embed-text:v1.5 provides a strong default.

FAQ

Can I use multiple embedding models in one Open WebUI instance?

Open WebUI uses one embedding model at a time. Re-index documents after switching.

Do domain-specific embedding models need a GPU?

No. Embedding models are small and run efficiently on CPU.

How do I know if a model is domain-specific?

Check the model card on Hugging Face for training data information.

**Sources:**

Related articles