Multi-query RAG uses an LLM to generate multiple rephrasings of the user’s question, runs retrieval for each, then fuses the results. On dedicated GPU hosting the cost is one extra LLM pass per query and typically 5-15% higher recall on hard queries.
Contents
Pattern
- Receive user query
- LLM rewrites into 3-5 alternative phrasings
- Retrieve top k for each phrasing
- Fuse with Reciprocal Rank Fusion
- Pass fused results to the answering LLM
Prompt
prompt = f"""
The user asked: {query}
Generate 4 alternative phrasings of this question that would retrieve relevant documents. Include different perspectives: as a keyword query, as a technical reformulation, as a related concept query, and as an inverse phrasing.
Return as a JSON array of strings.
"""
A small model (Llama 3 8B INT8) handles this well. Response time: ~300-500 ms on a 5080.
Fusion
Same RRF pattern as hybrid search. Score each document by sum of 1/(60 + rank) across all query variants where it appears.
scores = defaultdict(float)
for q in query_variants:
hits = retriever.search(q, limit=20)
for rank, hit in enumerate(hits):
scores[hit.id] += 1/(60 + rank)
top = sorted(scores.items(), key=lambda x: -x[1])[:10]
Latency
Typical overhead per query:
- Query rewrite LLM: ~400 ms
- N parallel retrievals: ~50 ms total (batched)
- Fusion: negligible
- Total added: ~500 ms
For chat UIs this is usually acceptable. For strict sub-second budgets, skip multi-query or use a tiny rewrite model (0.5B-1.5B).
Multi-Query RAG Hosting
UK dedicated GPUs with query-rewrite LLM and embedder running side by side.
Browse GPU ServersSee hybrid search and contextual retrieval.