Pure dense retrieval misses exact-keyword queries. Pure BM25 misses semantic matches. Combining them (“hybrid search”) routinely adds 10-15% recall on mixed workloads. On dedicated GPU hosting the implementation is straightforward.
Contents
Stack
- Dense: a GPU-hosted embedder (BGE-M3, Nomic, mxbai) feeding a vector DB (Qdrant, Milvus)
- Lexical: Elasticsearch or OpenSearch with BM25, or a simpler rank_bm25 in-process
- Fusion layer: reciprocal rank fusion or weighted score combination
Fusion
Reciprocal Rank Fusion (RRF): score = sum over retrievers of 1/(k + rank). k typically 60. Robust because score magnitudes from BM25 and cosine differ wildly – RRF only uses ranks.
Weighted score: normalise scores per retriever then weight-average. Requires careful calibration but can outperform RRF if your workloads are stable.
Default to RRF. Move to weighted scoring only if you have a held-out eval set and time to tune weights.
Implementation
def hybrid_search(query, k=10):
dense_hits = vector_db.search(embedder.encode(query), limit=50)
lexical_hits = bm25.search(query, limit=50)
scores = {}
for rank, hit in enumerate(dense_hits):
scores[hit.id] = scores.get(hit.id, 0) + 1/(60 + rank)
for rank, hit in enumerate(lexical_hits):
scores[hit.id] = scores.get(hit.id, 0) + 1/(60 + rank)
return sorted(scores.items(), key=lambda x: -x[1])[:k]
Pitfalls
- Deduplicating by document ID – both retrievers can return the same doc
- Skipping fusion when one retriever returns zero results – degrade to the other
- Retrieving too few candidates (50 is a reasonable minimum per retriever)
- Using hybrid search on workloads that do not need it (pure semantic or pure keyword) – adds latency without quality gain
Production Hybrid Retrieval Stack
Pre-built hybrid search on UK dedicated GPUs with embedder and BM25 working together.
Browse GPU ServersSee late interaction retrieval and BGE reranker.