RTX 3050 - Order Now
Home / Blog / Tutorials / Hybrid Search – BM25 Plus Embeddings on a GPU Server
Tutorials

Hybrid Search – BM25 Plus Embeddings on a GPU Server

Hybrid search combines classical lexical matching with dense vector retrieval. The implementation pattern that actually works in production.

Pure dense retrieval misses exact-keyword queries. Pure BM25 misses semantic matches. Combining them (“hybrid search”) routinely adds 10-15% recall on mixed workloads. On dedicated GPU hosting the implementation is straightforward.

Contents

Stack

  • Dense: a GPU-hosted embedder (BGE-M3, Nomic, mxbai) feeding a vector DB (Qdrant, Milvus)
  • Lexical: Elasticsearch or OpenSearch with BM25, or a simpler rank_bm25 in-process
  • Fusion layer: reciprocal rank fusion or weighted score combination

Fusion

Reciprocal Rank Fusion (RRF): score = sum over retrievers of 1/(k + rank). k typically 60. Robust because score magnitudes from BM25 and cosine differ wildly – RRF only uses ranks.

Weighted score: normalise scores per retriever then weight-average. Requires careful calibration but can outperform RRF if your workloads are stable.

Default to RRF. Move to weighted scoring only if you have a held-out eval set and time to tune weights.

Implementation

def hybrid_search(query, k=10):
    dense_hits = vector_db.search(embedder.encode(query), limit=50)
    lexical_hits = bm25.search(query, limit=50)

    scores = {}
    for rank, hit in enumerate(dense_hits):
        scores[hit.id] = scores.get(hit.id, 0) + 1/(60 + rank)
    for rank, hit in enumerate(lexical_hits):
        scores[hit.id] = scores.get(hit.id, 0) + 1/(60 + rank)

    return sorted(scores.items(), key=lambda x: -x[1])[:k]

Pitfalls

  • Deduplicating by document ID – both retrievers can return the same doc
  • Skipping fusion when one retriever returns zero results – degrade to the other
  • Retrieving too few candidates (50 is a reasonable minimum per retriever)
  • Using hybrid search on workloads that do not need it (pure semantic or pure keyword) – adds latency without quality gain

Production Hybrid Retrieval Stack

Pre-built hybrid search on UK dedicated GPUs with embedder and BM25 working together.

Browse GPU Servers

See late interaction retrieval and BGE reranker.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?