Table of Contents
Stock embeddings (BGE-large, E5, Jina) are trained on broad web data. For niche domains — legal, medical, technical, code, scientific — fine-tuning embeddings on domain-specific (query, positive-passage) pairs measurably improves retrieval quality. Cost: ~£10-30 of GPU time + a few hours of curation.
Use sentence-transformers MultipleNegativesRankingLoss on (query, relevant-passage) pairs from your domain. ~5K-50K pairs is enough for noticeable lift. Train on a 4090: ~2-6 hours. Quality lift: typically +5-15% recall@10 on domain-specific eval. Worth it for: legal, medical, code, niche-technical corpora.
When to fine-tune
- Niche domain vocabulary: stock embeddings haven't seen enough of your terminology
- Stock retrieval quality < 70% recall@10 on representative eval queries
- You have domain query-passage pairs: from search logs, expert-curated, or synthetic
- Domain stable enough to invest: vocabulary doesn't change weekly
For most general-purpose RAG, stock BGE-large is fine. Fine-tuning is the right move when you've measured a quality gap.
Data
Three sources of training data:
- Search logs: queries + clicked passages (positive pairs). Easiest if you have a search system.
- Expert-curated: SMEs label representative queries with correct passages. Highest quality.
- Synthetic: LLM generates queries from your passages. Cheaper but lower quality than expert-curated.
Aim for 5K-50K (query, positive-passage) pairs. Hard negatives improve quality further: passages that look relevant but aren't.
Recipe
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
train_examples = [InputExample(texts=[query, positive]) for query, positive in pairs]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.MultipleNegativesRankingLoss(model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
warmup_steps=100,
output_path="./bge-large-domain-tuned",
)
Re-embed your corpus with the fine-tuned model; verify quality via eval harness before promoting.
Verdict
Domain-specific embedding fine-tuning is one of the highest-ROI quality investments for niche-domain RAG. ~£10-30 of GPU time + a few hours of curation typically buys +5-15% retrieval recall. For most general workloads, stock BGE-large is fine; fine-tune when you've measured a gap.
Bottom line
Fine-tune embeddings for niche domains. See embedding GPU sizing.