Rerankers re-score candidate documents against a query for higher final RAG quality. Throughput on the RTX 5060 Ti 16GB via our hosting:
Contents
Setup
- Text Embeddings Inference (TEI) 1.5 with rerank endpoint
- Input: query ~20 tokens + doc ~256 tokens
- Metric: pairs/s
Models Compared
| Model | Params | Context | FP16 VRAM |
|---|---|---|---|
| BGE-reranker-base | 278M | 512 | 1.1 GB |
| BGE-reranker-large | 560M | 512 | 2.2 GB |
| Jina-reranker-v2 | 568M | 1024 | 2.3 GB |
| Mixedbread-rerank-v1 | 335M | 512 | 1.3 GB |
Pairs per Second (Batch 32)
| Model | pairs/s |
|---|---|
| BGE-reranker-base | 3,200 |
| BGE-reranker-large | 1,850 |
| Jina-reranker-v2 | 1,700 |
| Mixedbread-rerank-v1 | 2,400 |
Per-query latency in a 1 query x 100 candidates scenario: 31 ms on BGE-base, 55 ms on BGE-large. Rerank is cheap enough to include in every RAG query.
End-to-End RAG Latency
- Embed query: 3 ms
- Vector search top-100: 20 ms (vector DB, not GPU)
- Rerank top-100: 31 ms (BGE-reranker-base)
- LLM generation: 2,000 ms
Rerank adds ~30 ms to RAG. Always worth it – typical NDCG@10 uplift is 10-15%.
Reranking on Blackwell 16GB
3,200 pairs/s on BGE-base. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: reranker server setup, embedding throughput, RAG install, SaaS RAG.