Home / Blog / Tutorials / Hybrid Search – BM25 Plus Embeddings on a GPU Server

Tutorials

Hybrid Search – BM25 Plus Embeddings on a GPU Server

Hybrid search combines classical lexical matching with dense vector retrieval. The implementation pattern that actually works in production.

Tutorials April 23, 2026 2 min read admin

Pure dense retrieval misses exact-keyword queries. Pure BM25 misses semantic matches. Combining them (“hybrid search”) routinely adds 10-15% recall on mixed workloads. On dedicated GPU hosting the implementation is straightforward.

The stack
Fusion strategies
Practical implementation
Common pitfalls

Stack

Dense: a GPU-hosted embedder (BGE-M3, Nomic, mxbai) feeding a vector DB (Qdrant, Milvus)
Lexical: Elasticsearch or OpenSearch with BM25, or a simpler rank_bm25 in-process
Fusion layer: reciprocal rank fusion or weighted score combination

Fusion

Reciprocal Rank Fusion (RRF): score = sum over retrievers of 1/(k + rank). k typically 60. Robust because score magnitudes from BM25 and cosine differ wildly – RRF only uses ranks.

Weighted score: normalise scores per retriever then weight-average. Requires careful calibration but can outperform RRF if your workloads are stable.

Default to RRF. Move to weighted scoring only if you have a held-out eval set and time to tune weights.

Implementation

def hybrid_search(query, k=10):
    dense_hits = vector_db.search(embedder.encode(query), limit=50)
    lexical_hits = bm25.search(query, limit=50)

    scores = {}
    for rank, hit in enumerate(dense_hits):
        scores[hit.id] = scores.get(hit.id, 0) + 1/(60 + rank)
    for rank, hit in enumerate(lexical_hits):
        scores[hit.id] = scores.get(hit.id, 0) + 1/(60 + rank)

    return sorted(scores.items(), key=lambda x: -x[1])[:k]

Pitfalls

Deduplicating by document ID – both retrievers can return the same doc
Skipping fusion when one retriever returns zero results – degrade to the other
Retrieving too few candidates (50 is a reasonable minimum per retriever)
Using hybrid search on workloads that do not need it (pure semantic or pure keyword) – adds latency without quality gain

Production Hybrid Retrieval Stack

Pre-built hybrid search on UK dedicated GPUs with embedder and BM25 working together.

Browse GPU Servers

See late interaction retrieval and BGE reranker.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Hybrid Search – BM25 Plus Embeddings on a GPU Server

Contents

Stack

Fusion

Implementation

Pitfalls

Production Hybrid Retrieval Stack

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Hybrid Search – BM25 Plus Embeddings on a GPU Server

Contents

Stack

Fusion

Implementation

Pitfalls

Production Hybrid Retrieval Stack

Need a Dedicated GPU Server?

admin

Related Articles

Multi-Model Pipeline on One GPU

ControlNet Union Self-Hosted

LangGraph Production Deployment

AWQ Quantization Guide for RTX 5060 Ti 16GB

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?