Table of Contents
The RTX 3090 at £159/mo is the cheapest 24 GB GPU we host. Plenty for a small-team RAG deployment.
RAG on 3090: Llama 3.1 8B FP16 (16 GB) + BGE-large embeddings (1.5 GB) + BGE-reranker (1 GB) + Qdrant on disk. Comfortable for 15-20 active users at sub-1s end-to-end.
Stack
- vLLM serving Llama 3.1 8B FP16 (port 8000)
- TEI serving BGE-large + BGE-reranker (ports 8001, 8002)
- Qdrant on disk (port 6333)
- LiteLLM router (port 4000)
- Caddy reverse proxy + TLS (port 443)
Performance
- Llama 3.1 8B FP16 throughput: ~680 tok/s aggregate
- End-to-end query: embed (10ms) → retrieve (15ms) → rerank (50ms) → LLM (~250ms TTFT) ≈ 750ms total
- ~15-20 active concurrent users
Verdict
For small-team RAG (15-20 users), 3090 at £159/mo is the cost leader. For more concurrency, step up to 5090.
Bottom line
Cheapest production RAG host. See RAG architecture guide.