RTX 3050 - Order Now
Home / Blog / Tutorials / Multi-Query RAG on a Dedicated GPU
Tutorials

Multi-Query RAG on a Dedicated GPU

Generate multiple rewrites of the user's question, retrieve for each, fuse results. A cheap 5-15% recall lift over vanilla single-query RAG.

Multi-query RAG uses an LLM to generate multiple rephrasings of the user’s question, runs retrieval for each, then fuses the results. On dedicated GPU hosting the cost is one extra LLM pass per query and typically 5-15% higher recall on hard queries.

Contents

Pattern

  1. Receive user query
  2. LLM rewrites into 3-5 alternative phrasings
  3. Retrieve top k for each phrasing
  4. Fuse with Reciprocal Rank Fusion
  5. Pass fused results to the answering LLM

Prompt

prompt = f"""
The user asked: {query}

Generate 4 alternative phrasings of this question that would retrieve relevant documents. Include different perspectives: as a keyword query, as a technical reformulation, as a related concept query, and as an inverse phrasing.

Return as a JSON array of strings.
"""

A small model (Llama 3 8B INT8) handles this well. Response time: ~300-500 ms on a 5080.

Fusion

Same RRF pattern as hybrid search. Score each document by sum of 1/(60 + rank) across all query variants where it appears.

scores = defaultdict(float)
for q in query_variants:
    hits = retriever.search(q, limit=20)
    for rank, hit in enumerate(hits):
        scores[hit.id] += 1/(60 + rank)
top = sorted(scores.items(), key=lambda x: -x[1])[:10]

Latency

Typical overhead per query:

  • Query rewrite LLM: ~400 ms
  • N parallel retrievals: ~50 ms total (batched)
  • Fusion: negligible
  • Total added: ~500 ms

For chat UIs this is usually acceptable. For strict sub-second budgets, skip multi-query or use a tiny rewrite model (0.5B-1.5B).

Multi-Query RAG Hosting

UK dedicated GPUs with query-rewrite LLM and embedder running side by side.

Browse GPU Servers

See hybrid search and contextual retrieval.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?