Table of Contents
Embedding retrieval gets you to ~70% of RAG quality. Adding a reranker gets you the next 30%. BGE-reranker-v2 is the standard.
Run BGE-reranker-v2-m3 via Text Embeddings Inference (TEI). On a 5060 Ti: ~22K query-doc pairs/sec. Insert between embedding retrieval and LLM in your RAG pipeline.
Why a reranker
Embedding similarity returns docs that are roughly relevant. A cross-encoder reranker scores each query+doc pair carefully, giving meaningfully better top-N selection.
Standard pipeline: embedding top-50 → reranker top-5 → LLM.
Setup with TEI
docker run -d --gpus all -p 8002:80 \
-v /data/rerank-cache:/data \
ghcr.io/huggingface/text-embeddings-inference:latest \
--model-id BAAI/bge-reranker-v2-m3
Performance
| GPU | BGE-reranker-large pairs/sec | BGE-reranker-v2-m3 pairs/sec |
|---|---|---|
| RTX 3060 12 GB | ~22K | ~16K |
| RTX 5060 Ti 16 GB | ~28K | ~22K |
| RTX 5090 32 GB | ~95K | ~75K |
Verdict
BGE-reranker is essential for production RAG. Adds ~50 ms per query at top-50 scoring. Worth every millisecond.
Bottom line
Always include a reranker in production RAG. See reranker throughput on 5060 Ti.