Table of Contents
Most RAG architectures look the same on a whiteboard and very different in production. This page is the deployment-shaped reference architecture we recommend for self-hosted RAG on dedicated GPU hardware.
For a small-team RAG stack: RTX 5090 32 GB hosting Llama 3.1 8B FP8 + BGE-large embeddings + BGE-reranker + Qdrant on the same box. ~£399/mo handles ~25 active users with sub-1s end-to-end latency.
Components and their VRAM
| Component | Tool | VRAM |
|---|---|---|
| Embeddings | BGE-large-en-v1.5 via TEI | ~1.5 GB |
| Reranker | BGE-reranker-v2 | ~1 GB |
| LLM | Llama 3.1 8B FP8 via vLLM | ~8 GB |
| Vector store | Qdrant (CPU + disk) | 0 GB GPU |
| Router | LiteLLM | 0 GB GPU |
| Reverse proxy | Caddy | 0 GB GPU |
| Total GPU VRAM | — | ~13 GB |
Query flow
- User query arrives at Caddy (TLS) → LiteLLM (auth)
- LiteLLM hits TEI embeddings endpoint → query vector
- Qdrant top-50 retrieval by cosine similarity
- Reranker scores 50 query-doc pairs → top-5
- vLLM receives prompt = system + retrieved chunks + user query
- Streamed response back through LiteLLM to client
End-to-end latency on a 5090: ~750 ms median, ~1.5 s p99.
Hardware sizing by team size
| Team size | Recommended hardware | Notes |
|---|---|---|
| Small (5-10 users) | RTX 5060 Ti 16 GB | Tight but works |
| Medium (10-30) | RTX 5090 32 GB | Sweet spot |
| Large (30-100) | RTX 6000 Pro 96 GB | Multi-model headroom |
| XL (100+) | 2× 5090 cluster + dedicated embeddings host | Split workloads |
Verdict
For self-hosted RAG, single-card RTX 5090 + LiteLLM + Qdrant is the cleanest reference architecture. Above ~30 active users, split embeddings to a separate cheaper card.
Bottom line
Pre-built RAG stacks ship in days, not months. See build a production AI inference server for the broader infra picture.