Table of Contents
Customer-support chatbots are one of the most common production AI workloads. Self-hosting wins on cost predictability and data control.
Reference architecture: LiteLLM front, Llama 3.1 8B FP8 + RAG over support docs (BGE-large embeddings + reranker), Qdrant vector store. RTX 5090 hosts ~50 concurrent customers; smaller traffic fits the 5060 Ti.
Architecture
- Web widget → API gateway (Caddy + auth)
- Per-user session in Postgres / Redis
- RAG over knowledge base (Qdrant)
- LLM (Llama 3.1 8B FP8 default)
- Escalation rules → human handoff
- Conversation logging (with PII redaction) for QA
Hardware sizing by traffic tier
| Active concurrent customers | Recommended GPU | Monthly |
|---|---|---|
| 1-15 | RTX 5060 Ti 16 GB | £119 |
| 15-50 | RTX 5090 32 GB | £399 |
| 50-150 | RTX 6000 Pro 96 GB | £899 |
| 150+ | 2× RTX 5090 cluster + load balancer | £899+ |
Verdict
Self-hosted customer support chatbots beat hosted APIs starting at ~50 active concurrent users (depending on token volume). Below that, hosted is simpler and cheaper.
Bottom line
For a customer-facing chatbot at any meaningful scale, dedicated GPU hosting is the right deployment shape. See RAG architecture.