Benchmark Overview
A RAG pipeline involves four stages: query embedding, vector retrieval, optional reranking, and LLM generation. Most latency discussions focus only on generation speed, but the full pipeline includes overhead from every stage. We benchmarked end-to-end RAG latency across RTX 5090, RTX 6000 Pro, RTX 6000 Pro, and RTX 6000 Pro GPUs on dedicated GPU hosting to identify where time is actually spent.
Test Configuration
Pipeline: BGE-Large embedding (GPU), Qdrant vector search (CPU, 1M documents), BGE-Reranker-Large (GPU), Llama 3 70B INT4 generation via vLLM. Query returns top-10 chunks, reranks to top-3, generates a 256-token response. All components co-located on one server. See RAG hosting for deployment patterns.
End-to-End Latency Breakdown (Single Request)
| Stage | RTX 5090 | RTX 6000 Pro | RTX 6000 Pro 96 GB | RTX 6000 Pro |
|---|---|---|---|---|
| Query Embedding | 8ms | 10ms | 6ms | 4ms |
| Vector Retrieval (Qdrant) | 12ms | 12ms | 12ms | 12ms |
| Reranking (top-10 to top-3) | 35ms | 42ms | 28ms | 18ms |
| LLM Generation (256 tokens) | 4,200ms | 5,100ms | 3,800ms | 2,400ms |
| Total End-to-End | 4,255ms | 5,164ms | 3,846ms | 2,434ms |
| First Token Latency | 180ms | 220ms | 145ms | 95ms |
Stage-by-Stage Analysis
LLM generation dominates at 92-98% of total latency. Embedding and retrieval combined account for under 25ms regardless of GPU. Reranking adds 18-42ms depending on GPU speed. This means optimising generation speed has the highest impact on perceived RAG performance. A faster GPU saves seconds, while a faster vector store saves milliseconds. Check token speed benchmarks for generation throughput data.
However, at high concurrency (50+ users), vector retrieval can become the bottleneck if the database is undersized. Use ChromaDB for smaller collections and Qdrant for production-scale datasets. See the GPU selection guide for sizing recommendations.
Concurrency Impact
At 10 concurrent RAG requests, per-request latency increases 30-50% due to GPU contention during generation. At 50 concurrent requests, latency doubles. The non-GPU stages (retrieval) remain constant. Deploying embedding and generation on separate GPUs eliminates contention and keeps latency near single-request levels. See benchmarks for multi-GPU RAG scaling data.
Recommendations
Invest in GPU speed for the generation stage. An RTX 6000 Pro cuts total RAG latency by 43% compared to an RTX 5090. For cost-sensitive deployments, the RTX 6000 Pro offers the best latency-to-cost ratio. Deploy your RAG pipeline on GigaGPU dedicated servers with private AI hosting for data security. Explore LLM hosting for backend optimisation guides.