Table of Contents
For RAG reranker architecture, cross-encoders score higher on standard reranking benchmarks but are slower than bi-encoders. The right choice depends on top-K size and latency budget. BGE-reranker-v2-m3 is the standard cross-encoder used in production; bi-encoders are right for very-high-throughput requirements.
Cross-encoder: takes (query, candidate) pairs; cross-attention scores relevance; ~5-10× slower per pair but higher accuracy. Bi-encoder: separate encoders for query and candidate; cosine similarity; faster but less accurate. For top-K=10 reranking: cross-encoder (BGE-reranker-v2-m3) is the production default. Bi-encoder for very-high-throughput cases.
Comparison
- Cross-encoder: model takes (query, candidate) jointly; cross-attention attends across both; produces relevance score. Higher accuracy; ~5-10× slower per scored pair.
- Bi-encoder: separate encoders for query + candidate; produces dense vectors; cosine similarity for score. Faster; less accurate on relevance ranking.
For typical RAG (top-K=10-20 reranking after retrieval): cross-encoder accuracy advantage matters more than raw throughput. ~80-150ms total rerank latency is fine for production.
When each
- Cross-encoder (BGE-reranker-v2-m3): production default; top-K=10-50 reranking; quality-anchored
- Bi-encoder: very high throughput (1000+ candidates/query); first-stage retrieval; not the production rerank step
- Hybrid: bi-encoder for initial retrieval; cross-encoder rerank on top-K
Verdict
For RAG reranker architecture in 2026, cross-encoder (BGE-reranker-v2-m3) is the production default. The accuracy advantage on top-K reranking is real; the latency cost is manageable. Bi-encoders are right for first-stage retrieval (already standard via embedding models). Don't use bi-encoder as rerank stage; quality is meaningfully worse.
Bottom line
Cross-encoder for rerank; bi-encoder for retrieval. See reranker API.