RTX 3050 - Order Now
Home / Blog / Tutorials / RAG Deployment on RTX 3090 24 GB: The Cheap Production Stack
Tutorials

RAG Deployment on RTX 3090 24 GB: The Cheap Production Stack

Building a complete RAG stack on a single RTX 3090 — Llama 3.1 8B FP16, BGE embeddings, BGE-reranker, Qdrant. £179/mo total.

Table of Contents

  1. Stack
  2. Performance
  3. Verdict

The RTX 3090 at £159/mo is the cheapest 24 GB GPU we host. Plenty for a small-team RAG deployment.

TL;DR

RAG on 3090: Llama 3.1 8B FP16 (16 GB) + BGE-large embeddings (1.5 GB) + BGE-reranker (1 GB) + Qdrant on disk. Comfortable for 15-20 active users at sub-1s end-to-end.

Stack

  • vLLM serving Llama 3.1 8B FP16 (port 8000)
  • TEI serving BGE-large + BGE-reranker (ports 8001, 8002)
  • Qdrant on disk (port 6333)
  • LiteLLM router (port 4000)
  • Caddy reverse proxy + TLS (port 443)

Performance

  • Llama 3.1 8B FP16 throughput: ~680 tok/s aggregate
  • End-to-end query: embed (10ms) → retrieve (15ms) → rerank (50ms) → LLM (~250ms TTFT) ≈ 750ms total
  • ~15-20 active concurrent users

Verdict

For small-team RAG (15-20 users), 3090 at £159/mo is the cost leader. For more concurrency, step up to 5090.

Bottom line

Cheapest production RAG host. See RAG architecture guide.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?