A reranker is the quality multiplier on every serious RAG stack: it turns a bag of 50 dense-retrieval candidates into the three or five your LLM should actually read. Self-hosting it on the RTX 5060 Ti 16GB via UK dedicated GPU hosting delivers 3,200 query-document pairs per second on BGE-reranker-base and removes the per-query Cohere Rerank bill entirely.
Contents
- Deploying TEI
- Capacity and latency
- Client integration snippet
- Cost vs Cohere
- Pairing with an embedder
Deploying TEI
Hugging Face Text Embeddings Inference (TEI) ships a production-grade reranker binary with batching, dynamic padding and CUDA graphs. BGE-reranker-base loads in under 900 MB of VRAM; the larger BGE-reranker-v2-m3 fits in 2.3 GB. Both co-reside with a BGE-base embedder on the same 5060 Ti. See our reranker server setup.
Capacity and latency
| Model | VRAM | Pairs/sec | p50 latency | p99 latency |
|---|---|---|---|---|
| BGE-reranker-base | 0.9 GB | 3,200 | 8 ms | 28 ms |
| BGE-reranker-large | 1.8 GB | 1,400 | 14 ms | 42 ms |
| BGE-reranker-v2-m3 | 2.3 GB | 1,800 | 12 ms | 38 ms |
| mxbai-rerank-large-v1 | 1.9 GB | 1,500 | 14 ms | 41 ms |
At 3,200 pairs/second and 50% utilisation, one 5060 Ti reranks 138M pairs/day – roughly 2.8M user queries each paired against 50 candidates. See reranker throughput.
Client integration snippet
import httpx
def rerank(query: str, docs: list[str], top_k: int = 5):
r = httpx.post(
"https://rerank.example.com/rerank",
json={"query": query, "texts": docs, "top_k": top_k},
timeout=10.0,
)
r.raise_for_status()
return r.json() # [{index, score}, ...]
# Drop-in for cohere.rerank() by mapping field names.
Cost vs Cohere
| Volume | Cohere Rerank 3 | Self-hosted 5060 Ti |
|---|---|---|
| 100k queries / 50 cand | $100 (£79) | Fixed monthly |
| 1M queries / 50 cand | $1,000 (£790) | Fixed monthly |
| 10M queries / 50 cand | $10,000 (£7,900) | Fixed monthly |
| 50M queries / 50 cand | $50,000 (£39,400) | Fixed monthly |
Break-even vs Cohere Rerank 3 at $2 per 1,000 queries lands around 150k queries/month; above 1M/month self-hosting is decisively cheaper and also unlocks GDPR-clean UK data residency, which matters for regulated industries.
Pairing with an embedder
A proper RAG tier co-locates embeddings and rerank on one card: BGE-base at 10,200 texts/sec plus BGE-reranker-base at 3,200 pairs/sec fit comfortably in 3.2 GB combined. See embedding server and SaaS RAG for the joint deployment pattern.
Reranker API on Blackwell 16GB
Cohere Rerank alternative at 3,200 pairs/sec. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: embedding throughput, embedding API, vLLM setup, classification.