RTX 3050 - Order Now
Home / Blog / Benchmarks / RTX 5060 Ti 16GB Reranker Throughput
Benchmarks

RTX 5060 Ti 16GB Reranker Throughput

Cross-encoder reranker throughput on Blackwell 16GB - BGE, Jina, Cohere, and Mixedbread numbers on query-document scoring.

Rerankers re-score candidate documents against a query for higher final RAG quality. Throughput on the RTX 5060 Ti 16GB via our hosting:

Contents

Setup

  • Text Embeddings Inference (TEI) 1.5 with rerank endpoint
  • Input: query ~20 tokens + doc ~256 tokens
  • Metric: pairs/s

Models Compared

ModelParamsContextFP16 VRAM
BGE-reranker-base278M5121.1 GB
BGE-reranker-large560M5122.2 GB
Jina-reranker-v2568M10242.3 GB
Mixedbread-rerank-v1335M5121.3 GB

Pairs per Second (Batch 32)

Modelpairs/s
BGE-reranker-base3,200
BGE-reranker-large1,850
Jina-reranker-v21,700
Mixedbread-rerank-v12,400

Per-query latency in a 1 query x 100 candidates scenario: 31 ms on BGE-base, 55 ms on BGE-large. Rerank is cheap enough to include in every RAG query.

End-to-End RAG Latency

  • Embed query: 3 ms
  • Vector search top-100: 20 ms (vector DB, not GPU)
  • Rerank top-100: 31 ms (BGE-reranker-base)
  • LLM generation: 2,000 ms

Rerank adds ~30 ms to RAG. Always worth it – typical NDCG@10 uplift is 10-15%.

Reranking on Blackwell 16GB

3,200 pairs/s on BGE-base. UK dedicated hosting.

Order the RTX 5060 Ti 16GB

See also: reranker server setup, embedding throughput, RAG install, SaaS RAG.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?