RTX 3050 - Order Now
Home / Blog / Tutorials / Self-Hosted BGE Reranker Deployment Guide
Tutorials

Self-Hosted BGE Reranker Deployment Guide

BGE-reranker is the leading open-weight reranker for RAG quality. Here is the deployment recipe — TEI, throughput, and where it sits in your pipeline.

Embedding retrieval gets you to ~70% of RAG quality. Adding a reranker gets you the next 30%. BGE-reranker-v2 is the standard.

TL;DR

Run BGE-reranker-v2-m3 via Text Embeddings Inference (TEI). On a 5060 Ti: ~22K query-doc pairs/sec. Insert between embedding retrieval and LLM in your RAG pipeline.

Why a reranker

Embedding similarity returns docs that are roughly relevant. A cross-encoder reranker scores each query+doc pair carefully, giving meaningfully better top-N selection.

Standard pipeline: embedding top-50 → reranker top-5 → LLM.

Setup with TEI

docker run -d --gpus all -p 8002:80 \
  -v /data/rerank-cache:/data \
  ghcr.io/huggingface/text-embeddings-inference:latest \
  --model-id BAAI/bge-reranker-v2-m3

Performance

GPUBGE-reranker-large pairs/secBGE-reranker-v2-m3 pairs/sec
RTX 3060 12 GB~22K~16K
RTX 5060 Ti 16 GB~28K~22K
RTX 5090 32 GB~95K~75K

Verdict

BGE-reranker is essential for production RAG. Adds ~50 ms per query at top-50 scoring. Worth every millisecond.

Bottom line

Always include a reranker in production RAG. See reranker throughput on 5060 Ti.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?