RTX 3050 - Order Now
Home / Blog / Use Cases / Self-Hosted Customer Support Chatbot: Architecture and Hardware Sizing
Use Cases

Self-Hosted Customer Support Chatbot: Architecture and Hardware Sizing

How to architect a customer-support chatbot on dedicated GPU infrastructure — RAG over support docs, ticket-aware context, escalation logic, and the GPU sizing per traffic tier.

Customer-support chatbots are one of the most common production AI workloads. Self-hosting wins on cost predictability and data control.

TL;DR

Reference architecture: LiteLLM front, Llama 3.1 8B FP8 + RAG over support docs (BGE-large embeddings + reranker), Qdrant vector store. RTX 5090 hosts ~50 concurrent customers; smaller traffic fits the 5060 Ti.

Architecture

  • Web widget → API gateway (Caddy + auth)
  • Per-user session in Postgres / Redis
  • RAG over knowledge base (Qdrant)
  • LLM (Llama 3.1 8B FP8 default)
  • Escalation rules → human handoff
  • Conversation logging (with PII redaction) for QA

Hardware sizing by traffic tier

Active concurrent customersRecommended GPUMonthly
1-15RTX 5060 Ti 16 GB£119
15-50RTX 5090 32 GB£399
50-150RTX 6000 Pro 96 GB£899
150+2× RTX 5090 cluster + load balancer£899+

Verdict

Self-hosted customer support chatbots beat hosted APIs starting at ~50 active concurrent users (depending on token volume). Below that, hosted is simpler and cheaper.

Bottom line

For a customer-facing chatbot at any meaningful scale, dedicated GPU hosting is the right deployment shape. See RAG architecture.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?