RTX 3050 - Order Now
Home / Blog / Tutorials / Self-Hosted RAG Architecture: A Reference Implementation on Dedicated GPUs
Tutorials

Self-Hosted RAG Architecture: A Reference Implementation on Dedicated GPUs

End-to-end reference architecture for a production RAG stack on dedicated GPU hardware — vector store, embeddings, reranker, LLM, and the glue between.

Most RAG architectures look the same on a whiteboard and very different in production. This page is the deployment-shaped reference architecture we recommend for self-hosted RAG on dedicated GPU hardware.

TL;DR

For a small-team RAG stack: RTX 5090 32 GB hosting Llama 3.1 8B FP8 + BGE-large embeddings + BGE-reranker + Qdrant on the same box. ~£399/mo handles ~25 active users with sub-1s end-to-end latency.

Components and their VRAM

ComponentToolVRAM
EmbeddingsBGE-large-en-v1.5 via TEI~1.5 GB
RerankerBGE-reranker-v2~1 GB
LLMLlama 3.1 8B FP8 via vLLM~8 GB
Vector storeQdrant (CPU + disk)0 GB GPU
RouterLiteLLM0 GB GPU
Reverse proxyCaddy0 GB GPU
Total GPU VRAM~13 GB

Query flow

  1. User query arrives at Caddy (TLS) → LiteLLM (auth)
  2. LiteLLM hits TEI embeddings endpoint → query vector
  3. Qdrant top-50 retrieval by cosine similarity
  4. Reranker scores 50 query-doc pairs → top-5
  5. vLLM receives prompt = system + retrieved chunks + user query
  6. Streamed response back through LiteLLM to client

End-to-end latency on a 5090: ~750 ms median, ~1.5 s p99.

Hardware sizing by team size

Team sizeRecommended hardwareNotes
Small (5-10 users)RTX 5060 Ti 16 GBTight but works
Medium (10-30)RTX 5090 32 GBSweet spot
Large (30-100)RTX 6000 Pro 96 GBMulti-model headroom
XL (100+)2× 5090 cluster + dedicated embeddings hostSplit workloads

Verdict

For self-hosted RAG, single-card RTX 5090 + LiteLLM + Qdrant is the cleanest reference architecture. Above ~30 active users, split embeddings to a separate cheaper card.

Bottom line

Pre-built RAG stacks ship in days, not months. See build a production AI inference server for the broader infra picture.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?