RAG-backed SaaS (document Q&A, knowledge-base search, customer support automation) runs beautifully on one RTX 5060 Ti 16GB at our hosting.
Contents
Architecture
1. Docs stored in S3 / object storage
2. Ingestion: chunk -> embed (TEI, BGE-base) -> write to Qdrant
3. Query: user text -> embed -> Qdrant top-K -> rerank (TEI) -> LLM with context
4. Single 5060 Ti serves all three: embedding, rerank, LLM
Component Throughput
| Component | Throughput |
|---|---|
| BGE-base embedding | 10,200 texts/s |
| BGE-reranker-base (pairs/s) | 3,200 |
| Llama 3 8B FP8 decode (aggregate) | ~720 t/s |
Memory budget: 1.5 GB (BGE + reranker) + 8 GB (Llama FP8) + 2-3 GB KV = ~12 GB, leaves ~4 GB headroom.
End-to-End Query Latency
- Embed query: 3 ms
- Vector search top-100: 20 ms
- Rerank 100 candidates: 31 ms
- LLM generation (400 tokens): ~2,000 ms
- Total: ~2.1 s
Tenant Capacity
| Workload profile | Sustainable tenants |
|---|---|
| Light (few queries/day per user) | 500-1,000 |
| Medium (20 queries/day) | 150-300 |
| Heavy (chat-style, all-day) | 30-60 |
When to Scale
- Document corpus > 10M docs: add more Qdrant nodes (CPU, not GPU)
- > 60 heavy-chat tenants: add second card or upgrade
- Need 14B LLM quality: switch to Qwen 2.5 14B AWQ – still fits
- Latency budget < 1s: shorter responses, smaller model, or prefix caching
SaaS RAG on Blackwell 16GB
Embed + rerank + LLM on one card. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: RAG install, embedding throughput, reranker throughput, document Q&A, startup MVP, RAG pipeline.