RTX 3050 - Order Now
Home / Blog / Use Cases / RTX 5060 Ti 16GB for RAG Pipeline
Use Cases

RTX 5060 Ti 16GB for RAG Pipeline

Run a full production RAG pipeline - embeddings, reranker and an 8B LLM - on a single RTX 5060 Ti 16GB, with concrete throughput and latency numbers.

A retrieval-augmented generation (RAG) pipeline is usually split across several services: an embedding model, a vector database, a reranker, and a generation LLM. On older hardware that meant two or three GPUs. The Blackwell RTX 5060 Ti 16GB compresses the full stack onto a single 180 W card, with enough VRAM headroom to keep embeddings, reranker and an FP8 8B LLM resident simultaneously. This guide walks through the components, the per-stage numbers, and the end-to-end latency you can expect from a UK-hosted Gigagpu node.

Contents

Pipeline stack

The reference configuration uses BGE-base-en-v1.5 for dense embeddings, BGE-reranker-base for cross-encoder reranking, Qdrant for vector storage, and Llama 3 8B Instruct quantised to FP8 as the generator. Every component runs on the same physical host, eliminating network hops between retrieval and generation. Qdrant sits on NVMe rather than VRAM, so the GPU is free for neural workloads only.

ComponentModelPrecisionRole
EmbedderBGE-base-en-v1.5 (110M)FP16Query + corpus vectors
Vector storeQdrant 1.12CPU/NVMeANN search
RerankerBGE-reranker-base (278M)FP16Top-100 -> top-8
GeneratorLlama 3 8B InstructFP8 (W8A8)Grounded answer

VRAM and service layout

All three neural components fit comfortably in 16 GB with room for a 16k context KV cache on the generator. Peak steady-state utilisation sits at around 13.2 GB, leaving headroom for bursty concurrency.

ServiceWeightsActivationsKV / batchVRAM
BGE-base embedder0.22 GB0.4 GB0.7 GB
BGE-reranker-base0.56 GB0.6 GB1.2 GB
Llama 3 8B FP88.1 GB0.6 GB2.6 GB (16k)11.3 GB
Driver + CUDA ctx~0.5 GB
Total13.7 GB

Per-stage throughput

Throughput numbers below are measured on a cold cache with warm CUDA graphs, using vLLM 0.6 for the generator and ONNX Runtime + TensorRT-LLM for the encoders. BGE-base sustains 10,000 texts/s at 256 tokens average length, and the reranker clears 3,200 query-document pairs/s.

StageBatchThroughputLatency
BGE-base embed25610,000 texts/s25 ms / batch
Qdrant ANN (HNSW)1~2,000 queries/s4 ms
BGE-reranker100 pairs3,200 pairs/s31 ms
Llama 3 8B FP8 single1112 t/s
Llama 3 8B FP8 agg.16720 t/s

End-to-end latency

For a typical RAG turn – 200-token query, 8 retrieved chunks averaging 600 tokens, 300-token answer – the measured wall clock is around 3.1 seconds with streaming to first token in 340 ms.

Query embed           :  12 ms
Qdrant top-100 ANN    :   4 ms
Reranker (100 pairs)  :  31 ms
Context assembly      :   5 ms
Prefill (5.2k tokens) : 290 ms  -> first token
Decode (300 tokens)   : 2680 ms (112 t/s)
-----------------------------------------
Total (streamed TTFT) :  342 ms
Total (full answer)   : 3022 ms

Capacity and scaling

With continuous batching in vLLM, one 5060 Ti 16GB holds roughly 12-16 concurrent RAG users at 1-2 req/s each while keeping p95 answer latency under 5 seconds. The embedder is essentially free in that envelope because encoder calls are two orders of magnitude faster than the generator. For ingestion-heavy workloads, the card can embed 36 million chunks per hour, which is enough to backfill a 100-million-document corpus overnight.

Deployment recipe

See the detailed install guide for the exact docker-compose, but the sequence is:

docker run -d --gpus all -p 6333:6333 qdrant/qdrant:v1.12.0
docker run -d --gpus all -p 8001:8000 \
  -e MODEL=BAAI/bge-base-en-v1.5 infinity:latest
docker run -d --gpus all -p 8002:8000 \
  -e MODEL=BAAI/bge-reranker-base infinity:latest
docker run -d --gpus all -p 8000:8000 \
  vllm/vllm-openai:v0.6.3 \
  --model neuralmagic/Meta-Llama-3-8B-Instruct-FP8 \
  --max-model-len 16384 --gpu-memory-utilization 0.72

Ship production RAG on a single GPU

Embedder, reranker, vector DB and 8B LLM on one Blackwell card. UK dedicated hosting.

Order the RTX 5060 Ti 16GB

See also: 5060 Ti for SaaS RAG, RAG stack install guide, embedding throughput, reranker throughput, FP8 Llama deployment, vLLM setup.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?