Table of Contents
RAG Benchmark Overview
Retrieval-augmented generation pipelines involve three latency-sensitive stages: embedding the query, retrieving relevant documents, and generating the response. This April 2026 benchmark update measures each stage independently and as a combined pipeline on GigaGPU dedicated servers.
We used BGE-large-en for embeddings, Qdrant for vector retrieval over 1 million documents, and LLaMA 3.1 70B (Q4) via vLLM for generation. Results reflect a realistic production RAG configuration. For the interactive tool, see the tokens per second benchmark.
Embedding Generation Speed
Query embedding latency (single query, BGE-large-en, 768 dimensions):
| GPU | Single Query Embed | Batch of 32 Queries | Throughput (queries/sec) |
|---|---|---|---|
| RTX 5090 | 3.2 ms | 18 ms | 1,780 |
| RTX 5090 | 2.1 ms | 12 ms | 2,670 |
| RTX 3090 | 4.8 ms | 28 ms | 1,140 |
| RTX 6000 Pro | 3.8 ms | 22 ms | 1,455 |
| CPU (8-core) | 45 ms | 320 ms | 100 |
GPU embedding is 10-15x faster than CPU. For detailed GPU vs CPU numbers, see the embedding speed benchmark.
Retrieval Latency Results
Vector search latency in Qdrant over 1 million documents, top-10 retrieval with metadata filtering:
| Concurrent Queries | P50 Latency | P99 Latency | QPS |
|---|---|---|---|
| 1 | 1.2 ms | 3.5 ms | 830 |
| 10 | 2.1 ms | 8.5 ms | 4,200 |
| 50 | 4.8 ms | 18.2 ms | 10,400 |
Retrieval is the fastest stage in the pipeline. Even at 50 concurrent queries, P99 stays under 20ms. The LLM generation stage dominates total latency.
End-to-End Pipeline Performance
Total time from user query to complete response (embed + retrieve + generate 256 tokens):
| GPU | Model | P50 Total Latency | P99 Total Latency | First Token |
|---|---|---|---|---|
| RTX 5090 | LLaMA 70B Q4 | 4.5 s | 6.8 s | 165 ms |
| RTX 5090 | LLaMA 70B Q4 | 3.2 s | 4.9 s | 125 ms |
| 2x RTX 5090 | LLaMA 70B FP16 | 3.3 s | 5.1 s | 110 ms |
| RTX 5090 | Gemma 2 27B FP16 | 2.9 s | 4.2 s | 95 ms |
Generation is 95%+ of total pipeline latency. Optimise the LLM stage first. For more pipeline configurations, see the RAG pipeline latency by GPU benchmark.
GPU Recommendations for RAG
For sub-5-second RAG responses with a 70B model, an RTX 5090 meets the target for single-user workloads. For concurrent users, dual RTX 5090s or an RTX 5090 keep latency manageable. For budget RAG with a 27B model, a single RTX 5090 delivers sub-3-second responses.
Co-locate all three components (embeddings, vector DB, LLM) on the same dedicated server to eliminate network hops. Use the vector database guide for database selection and the RAG frameworks guide for framework comparison.
Build Your RAG Pipeline on Dedicated Hardware
Embed, retrieve, and generate on one server. No external dependencies, no per-query fees, complete data privacy.
View GPU ServersOptimisation Tips
Cache frequently requested embeddings to skip the embedding stage entirely. Use vLLM’s prefix caching for common system prompts. Keep your vector index in RAM rather than on disk. Pre-compute embeddings for your document corpus during off-peak hours. These optimisations can reduce effective pipeline latency by 30-40% for repeat queries.
For cost planning, see the RAG pipeline total cost breakdown. Compare with budget GPU options if cost is the primary constraint. The open-source LLM hosting section covers deployment guides for each model in the pipeline.