RTX 3050 - Order Now
Home / Blog / Benchmarks / RAG Benchmark Update: April 2026
Benchmarks

RAG Benchmark Update: April 2026

Updated April 2026 RAG pipeline benchmarks measuring end-to-end retrieval and generation performance across GPUs. Covers embedding speed, retrieval latency, and total pipeline throughput.

RAG Benchmark Overview

Retrieval-augmented generation pipelines involve three latency-sensitive stages: embedding the query, retrieving relevant documents, and generating the response. This April 2026 benchmark update measures each stage independently and as a combined pipeline on GigaGPU dedicated servers.

We used BGE-large-en for embeddings, Qdrant for vector retrieval over 1 million documents, and LLaMA 3.1 70B (Q4) via vLLM for generation. Results reflect a realistic production RAG configuration. For the interactive tool, see the tokens per second benchmark.

Embedding Generation Speed

Query embedding latency (single query, BGE-large-en, 768 dimensions):

GPU Single Query Embed Batch of 32 Queries Throughput (queries/sec)
RTX 5090 3.2 ms 18 ms 1,780
RTX 5090 2.1 ms 12 ms 2,670
RTX 3090 4.8 ms 28 ms 1,140
RTX 6000 Pro 3.8 ms 22 ms 1,455
CPU (8-core) 45 ms 320 ms 100

GPU embedding is 10-15x faster than CPU. For detailed GPU vs CPU numbers, see the embedding speed benchmark.

Retrieval Latency Results

Vector search latency in Qdrant over 1 million documents, top-10 retrieval with metadata filtering:

Concurrent Queries P50 Latency P99 Latency QPS
1 1.2 ms 3.5 ms 830
10 2.1 ms 8.5 ms 4,200
50 4.8 ms 18.2 ms 10,400

Retrieval is the fastest stage in the pipeline. Even at 50 concurrent queries, P99 stays under 20ms. The LLM generation stage dominates total latency.

End-to-End Pipeline Performance

Total time from user query to complete response (embed + retrieve + generate 256 tokens):

GPU Model P50 Total Latency P99 Total Latency First Token
RTX 5090 LLaMA 70B Q4 4.5 s 6.8 s 165 ms
RTX 5090 LLaMA 70B Q4 3.2 s 4.9 s 125 ms
2x RTX 5090 LLaMA 70B FP16 3.3 s 5.1 s 110 ms
RTX 5090 Gemma 2 27B FP16 2.9 s 4.2 s 95 ms

Generation is 95%+ of total pipeline latency. Optimise the LLM stage first. For more pipeline configurations, see the RAG pipeline latency by GPU benchmark.

GPU Recommendations for RAG

For sub-5-second RAG responses with a 70B model, an RTX 5090 meets the target for single-user workloads. For concurrent users, dual RTX 5090s or an RTX 5090 keep latency manageable. For budget RAG with a 27B model, a single RTX 5090 delivers sub-3-second responses.

Co-locate all three components (embeddings, vector DB, LLM) on the same dedicated server to eliminate network hops. Use the vector database guide for database selection and the RAG frameworks guide for framework comparison.

Build Your RAG Pipeline on Dedicated Hardware

Embed, retrieve, and generate on one server. No external dependencies, no per-query fees, complete data privacy.

View GPU Servers

Optimisation Tips

Cache frequently requested embeddings to skip the embedding stage entirely. Use vLLM’s prefix caching for common system prompts. Keep your vector index in RAM rather than on disk. Pre-compute embeddings for your document corpus during off-peak hours. These optimisations can reduce effective pipeline latency by 30-40% for repeat queries.

For cost planning, see the RAG pipeline total cost breakdown. Compare with budget GPU options if cost is the primary constraint. The open-source LLM hosting section covers deployment guides for each model in the pipeline.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?