A retrieval-augmented generation (RAG) pipeline is usually split across several services: an embedding model, a vector database, a reranker, and a generation LLM. On older hardware that meant two or three GPUs. The Blackwell RTX 5060 Ti 16GB compresses the full stack onto a single 180 W card, with enough VRAM headroom to keep embeddings, reranker and an FP8 8B LLM resident simultaneously. This guide walks through the components, the per-stage numbers, and the end-to-end latency you can expect from a UK-hosted Gigagpu node.
Contents
- Pipeline stack
- VRAM and service layout
- Per-stage throughput
- End-to-end latency
- Capacity and scaling
- Deployment recipe
Pipeline stack
The reference configuration uses BGE-base-en-v1.5 for dense embeddings, BGE-reranker-base for cross-encoder reranking, Qdrant for vector storage, and Llama 3 8B Instruct quantised to FP8 as the generator. Every component runs on the same physical host, eliminating network hops between retrieval and generation. Qdrant sits on NVMe rather than VRAM, so the GPU is free for neural workloads only.
| Component | Model | Precision | Role |
|---|---|---|---|
| Embedder | BGE-base-en-v1.5 (110M) | FP16 | Query + corpus vectors |
| Vector store | Qdrant 1.12 | CPU/NVMe | ANN search |
| Reranker | BGE-reranker-base (278M) | FP16 | Top-100 -> top-8 |
| Generator | Llama 3 8B Instruct | FP8 (W8A8) | Grounded answer |
VRAM and service layout
All three neural components fit comfortably in 16 GB with room for a 16k context KV cache on the generator. Peak steady-state utilisation sits at around 13.2 GB, leaving headroom for bursty concurrency.
| Service | Weights | Activations | KV / batch | VRAM |
|---|---|---|---|---|
| BGE-base embedder | 0.22 GB | 0.4 GB | – | 0.7 GB |
| BGE-reranker-base | 0.56 GB | 0.6 GB | – | 1.2 GB |
| Llama 3 8B FP8 | 8.1 GB | 0.6 GB | 2.6 GB (16k) | 11.3 GB |
| Driver + CUDA ctx | – | – | – | ~0.5 GB |
| Total | 13.7 GB |
Per-stage throughput
Throughput numbers below are measured on a cold cache with warm CUDA graphs, using vLLM 0.6 for the generator and ONNX Runtime + TensorRT-LLM for the encoders. BGE-base sustains 10,000 texts/s at 256 tokens average length, and the reranker clears 3,200 query-document pairs/s.
| Stage | Batch | Throughput | Latency |
|---|---|---|---|
| BGE-base embed | 256 | 10,000 texts/s | 25 ms / batch |
| Qdrant ANN (HNSW) | 1 | ~2,000 queries/s | 4 ms |
| BGE-reranker | 100 pairs | 3,200 pairs/s | 31 ms |
| Llama 3 8B FP8 single | 1 | 112 t/s | – |
| Llama 3 8B FP8 agg. | 16 | 720 t/s | – |
End-to-end latency
For a typical RAG turn – 200-token query, 8 retrieved chunks averaging 600 tokens, 300-token answer – the measured wall clock is around 3.1 seconds with streaming to first token in 340 ms.
Query embed : 12 ms
Qdrant top-100 ANN : 4 ms
Reranker (100 pairs) : 31 ms
Context assembly : 5 ms
Prefill (5.2k tokens) : 290 ms -> first token
Decode (300 tokens) : 2680 ms (112 t/s)
-----------------------------------------
Total (streamed TTFT) : 342 ms
Total (full answer) : 3022 ms
Capacity and scaling
With continuous batching in vLLM, one 5060 Ti 16GB holds roughly 12-16 concurrent RAG users at 1-2 req/s each while keeping p95 answer latency under 5 seconds. The embedder is essentially free in that envelope because encoder calls are two orders of magnitude faster than the generator. For ingestion-heavy workloads, the card can embed 36 million chunks per hour, which is enough to backfill a 100-million-document corpus overnight.
Deployment recipe
See the detailed install guide for the exact docker-compose, but the sequence is:
docker run -d --gpus all -p 6333:6333 qdrant/qdrant:v1.12.0
docker run -d --gpus all -p 8001:8000 \
-e MODEL=BAAI/bge-base-en-v1.5 infinity:latest
docker run -d --gpus all -p 8002:8000 \
-e MODEL=BAAI/bge-reranker-base infinity:latest
docker run -d --gpus all -p 8000:8000 \
vllm/vllm-openai:v0.6.3 \
--model neuralmagic/Meta-Llama-3-8B-Instruct-FP8 \
--max-model-len 16384 --gpu-memory-utilization 0.72
Ship production RAG on a single GPU
Embedder, reranker, vector DB and 8B LLM on one Blackwell card. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: 5060 Ti for SaaS RAG, RAG stack install guide, embedding throughput, reranker throughput, FP8 Llama deployment, vLLM setup.