Home / Blog / Tutorials / Self-Hosted RAG Architecture: A Reference Implementation on Dedicated GPUs

Tutorials

Self-Hosted RAG Architecture: A Reference Implementation on Dedicated GPUs

End-to-end reference architecture for a production RAG stack on dedicated GPU hardware — vector store, embeddings, reranker, LLM, and the glue between.

Tutorials May 5, 2026 1 min read gigagpu

Table of Contents

Most RAG architectures look the same on a whiteboard and very different in production. This page is the deployment-shaped reference architecture we recommend for self-hosted RAG on dedicated GPU hardware.

TL;DR

For a small-team RAG stack: RTX 5090 32 GB hosting Llama 3.1 8B FP8 + BGE-large embeddings + BGE-reranker + Qdrant on the same box. ~£399/mo handles ~25 active users with sub-1s end-to-end latency.

Components and their VRAM

Component	Tool	VRAM
Embeddings	BGE-large-en-v1.5 via TEI	~1.5 GB
Reranker	BGE-reranker-v2	~1 GB
LLM	Llama 3.1 8B FP8 via vLLM	~8 GB
Vector store	Qdrant (CPU + disk)	0 GB GPU
Router	LiteLLM	0 GB GPU
Reverse proxy	Caddy	0 GB GPU
Total GPU VRAM	—	~13 GB

Query flow

User query arrives at Caddy (TLS) → LiteLLM (auth)
LiteLLM hits TEI embeddings endpoint → query vector
Qdrant top-50 retrieval by cosine similarity
Reranker scores 50 query-doc pairs → top-5
vLLM receives prompt = system + retrieved chunks + user query
Streamed response back through LiteLLM to client

End-to-end latency on a 5090: ~750 ms median, ~1.5 s p99.

Hardware sizing by team size

Team size	Recommended hardware	Notes
Small (5-10 users)	RTX 5060 Ti 16 GB	Tight but works
Medium (10-30)	RTX 5090 32 GB	Sweet spot
Large (30-100)	RTX 6000 Pro 96 GB	Multi-model headroom
XL (100+)	2× 5090 cluster + dedicated embeddings host	Split workloads

Verdict

For self-hosted RAG, single-card RTX 5090 + LiteLLM + Qdrant is the cleanest reference architecture. Above ~30 active users, split embeddings to a separate cheaper card.

Bottom line

Pre-built RAG stacks ship in days, not months. See build a production AI inference server for the broader infra picture.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Self-Hosted RAG Architecture: A Reference Implementation on Dedicated GPUs

Components and their VRAM

Query flow

Hardware sizing by team size

Verdict

Bottom line

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Self-Hosted RAG Architecture: A Reference Implementation on Dedicated GPUs

Components and their VRAM

Query flow

Hardware sizing by team size

Verdict

Bottom line

Need a Dedicated GPU Server?

gigagpu

Related Articles

OCR + LLM Document Summarisation Pipeline

Whisper + Pyannote Diarization on a GPU

Dual RTX 5090 Llama 3 70B Deployment – Tensor Parallel Setup

Flowise vs LangFlow: Visual AI Builders

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?