Home / Blog / Benchmarks / RAG Benchmark Update: April 2026

Benchmarks

RAG Benchmark Update: April 2026

Updated April 2026 RAG pipeline benchmarks measuring end-to-end retrieval and generation performance across GPUs. Covers embedding speed, retrieval latency, and total pipeline throughput.

Benchmarks April 16, 2026 2 min read gigagpu

RAG Benchmark Overview
Embedding Generation Speed
Retrieval Latency Results
End-to-End Pipeline Performance
GPU Recommendations for RAG
Optimisation Tips

RAG Benchmark Overview

Retrieval-augmented generation pipelines involve three latency-sensitive stages: embedding the query, retrieving relevant documents, and generating the response. This April 2026 benchmark update measures each stage independently and as a combined pipeline on GigaGPU dedicated servers.

We used BGE-large-en for embeddings, Qdrant for vector retrieval over 1 million documents, and LLaMA 3.1 70B (Q4) via vLLM for generation. Results reflect a realistic production RAG configuration. For the interactive tool, see the tokens per second benchmark.

Embedding Generation Speed

Query embedding latency (single query, BGE-large-en, 768 dimensions):

GPU	Single Query Embed	Batch of 32 Queries	Throughput (queries/sec)
RTX 5090	3.2 ms	18 ms	1,780
RTX 5090	2.1 ms	12 ms	2,670
RTX 3090	4.8 ms	28 ms	1,140
RTX 6000 Pro	3.8 ms	22 ms	1,455
CPU (8-core)	45 ms	320 ms	100

GPU embedding is 10-15x faster than CPU. For detailed GPU vs CPU numbers, see the embedding speed benchmark.

Retrieval Latency Results

Vector search latency in Qdrant over 1 million documents, top-10 retrieval with metadata filtering:

Concurrent Queries	P50 Latency	P99 Latency	QPS
1	1.2 ms	3.5 ms	830
10	2.1 ms	8.5 ms	4,200
50	4.8 ms	18.2 ms	10,400

Retrieval is the fastest stage in the pipeline. Even at 50 concurrent queries, P99 stays under 20ms. The LLM generation stage dominates total latency.

End-to-End Pipeline Performance

Total time from user query to complete response (embed + retrieve + generate 256 tokens):

GPU	Model	P50 Total Latency	P99 Total Latency	First Token
RTX 5090	LLaMA 70B Q4	4.5 s	6.8 s	165 ms
RTX 5090	LLaMA 70B Q4	3.2 s	4.9 s	125 ms
2x RTX 5090	LLaMA 70B FP16	3.3 s	5.1 s	110 ms
RTX 5090	Gemma 2 27B FP16	2.9 s	4.2 s	95 ms

Generation is 95%+ of total pipeline latency. Optimise the LLM stage first. For more pipeline configurations, see the RAG pipeline latency by GPU benchmark.

GPU Recommendations for RAG

For sub-5-second RAG responses with a 70B model, an RTX 5090 meets the target for single-user workloads. For concurrent users, dual RTX 5090s or an RTX 5090 keep latency manageable. For budget RAG with a 27B model, a single RTX 5090 delivers sub-3-second responses.

Co-locate all three components (embeddings, vector DB, LLM) on the same dedicated server to eliminate network hops. Use the vector database guide for database selection and the RAG frameworks guide for framework comparison.

Build Your RAG Pipeline on Dedicated Hardware

Embed, retrieve, and generate on one server. No external dependencies, no per-query fees, complete data privacy.

View GPU Servers

Optimisation Tips

Cache frequently requested embeddings to skip the embedding stage entirely. Use vLLM’s prefix caching for common system prompts. Keep your vector index in RAM rather than on disk. Pre-compute embeddings for your document corpus during off-peak hours. These optimisations can reduce effective pipeline latency by 30-40% for repeat queries.

For cost planning, see the RAG pipeline total cost breakdown. Compare with budget GPU options if cost is the primary constraint. The open-source LLM hosting section covers deployment guides for each model in the pipeline.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

RAG Benchmark Update: April 2026

Table of Contents

RAG Benchmark Overview

Embedding Generation Speed

Retrieval Latency Results

End-to-End Pipeline Performance

GPU Recommendations for RAG

Build Your RAG Pipeline on Dedicated Hardware

Optimisation Tips

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RAG Benchmark Update: April 2026

Table of Contents

RAG Benchmark Overview

Embedding Generation Speed

Retrieval Latency Results

End-to-End Pipeline Performance

GPU Recommendations for RAG

Build Your RAG Pipeline on Dedicated Hardware

Optimisation Tips

Need a Dedicated GPU Server?

gigagpu

Related Articles

RTX 4090 24GB Mixtral Benchmark: 8x7B Fits, 8x22B Does Not

Flux.1 on RTX 3050: Images/sec & VRAM Usage, Category: Benchmarks, Slug: flux-1-on-rtx-3050-benchmark, Excerpt: Flux.1 benchmarked on RTX 3050: 0.15 it/s, 0.45 images/min at 1024×1024, VRAM usage, and cost per 1K images., Internal links: 8 –>

AI Chatbot Response Time by GPU and Model

CPU Bottleneck in AI: Detect & Fix

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?