Rerankers re-score candidate passages for higher RAG quality. TEI serves them nicely alongside your embedding server on the RTX 5060 Ti 16GB at our hosting.
Contents
Deploy with TEI
docker run --gpus all -p 8081:80 \
-v $PWD/tei-rerank:/data \
ghcr.io/huggingface/text-embeddings-inference:cuda-1.5 \
--model-id BAAI/bge-reranker-base \
--max-batch-tokens 32768
API
curl http://localhost:8081/rerank \
-H "Content-Type: application/json" \
-d '{
"query": "When did we launch the product?",
"texts": [
"We launched in June 2024.",
"Our office is in London.",
"Product design started Jan 2024."
]
}'
Response contains scored candidates sorted by relevance.
Integrate into RAG
- Embed query, retrieve top-100 from vector DB
- POST query + top-100 to rerank endpoint
- Take top-4 reranked candidates
- Pass those to the LLM as context
Latency: ~30-60 ms to rerank 100 candidates on BGE-reranker-base. Worth every millisecond – NDCG@10 typically improves 10-15%.
Model Picks
| Model | Quality | Speed |
|---|---|---|
| BAAI/bge-reranker-base | Good | 3,200 pairs/s |
| BAAI/bge-reranker-large | Better | 1,850 pairs/s |
| jinaai/jina-reranker-v2-base-multilingual | Multilingual | 1,700 pairs/s |
| mixedbread-ai/mxbai-rerank-large-v1 | Strong | 2,400 pairs/s |
Default: BGE-reranker-base for English RAG. Upgrade to BGE-reranker-large for quality-critical workloads.
Reranker Server on Blackwell 16GB
3,200 pairs/s on BGE-base. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: throughput numbers, embedding server, RAG stack, SaaS RAG.