Most production embedders are 100-500M parameters. E5-Mistral-7B-instruct is a 7B LLM repurposed as an embedder, trading cost for quality. On our dedicated GPU hosting it fits a 16 GB+ card at FP16 and delivers meaningfully better retrieval on hard queries.
Contents
VRAM
~14 GB at FP16 for weights. Add batch activation memory and you need roughly 18-22 GB. Fits comfortably on a 3090 or 5090.
Deployment
docker run --gpus all -p 8080:80 \
ghcr.io/huggingface/text-embeddings-inference:1.5 \
--model-id intfloat/e5-mistral-7b-instruct \
--dtype float16 \
--max-client-batch-size 32
Batch size must be lower than small embedders – each sample uses more compute and memory.
The Trade
| Model | Params | MTEB Avg | Throughput |
|---|---|---|---|
| Nomic Embed v1.5 | 137M | ~62 | ~12,000 docs/s |
| BGE-M3 | 568M | ~65 | ~6,000 docs/s |
| mxbai-embed-large | 335M | ~64 | ~8,000 docs/s |
| E5-Mistral-7B | 7B | ~67 | ~600 docs/s |
E5-Mistral is 10-20x slower than small embedders for ~2-5% quality lift. Worth it for hard retrieval tasks with low query volume. Wasteful for bulk indexing 100M documents.
Instructions
E5-Mistral uses instruction-tuned queries. Format:
query = "Instruct: Given a claim, find documents that refute it.\nQuery: " + user_query
Different instruction prefixes improve retrieval on specific tasks. Document side takes no prefix.
High-Quality LLM-Based Embedder
E5-Mistral or similar 7B embedders on UK dedicated GPU hosting.
Browse GPU Servers