RTX 3050 - Order Now
Home / Blog / Tutorials / Multi-Tenant AI Chatbot SaaS Architecture on Self-Hosted GPUs
Tutorials

Multi-Tenant AI Chatbot SaaS Architecture on Self-Hosted GPUs

Building a multi-tenant chatbot SaaS on dedicated GPU infrastructure — tenant isolation, per-tenant rate limiting, model routing, and the cost model that pays back.

SaaS products built on AI inference traditionally route to OpenAI / Anthropic and pass costs to customers. Self-hosted infrastructure changes the unit economics — but only if the multi-tenant architecture is right.

TL;DR

For multi-tenant chatbot SaaS: LiteLLM as the router (per-tenant API keys + rate limits), vLLM as the engine (multi-LoRA for per-tenant fine-tunes), Qdrant with per-tenant collections for RAG. Self-hosting wins above ~30 active tenants.

The reference architecture

  • API gateway: Caddy with TLS, mTLS or JWT auth.
  • Router: LiteLLM with per-tenant master keys, per-tenant budgets and rate limits.
  • Inference engine: vLLM with --enable-lora serving a base + per-tenant LoRA adapters.
  • RAG infra: Qdrant with per-tenant collection isolation. BGE-large + reranker.
  • Vector store auth: row-level security via collection naming convention.
  • Observability: per-tenant metrics tagged in Prometheus.
  • Billing: LiteLLM per-key cost tracking in Postgres.

Tenant isolation

  • API key isolation: each tenant gets their own LiteLLM key. Per-key budget caps prevent runaway costs.
  • Rate limiting: per-key requests-per-minute and tokens-per-minute.
  • Vector isolation: per-tenant Qdrant collection. Application-layer filter ensures cross-tenant query is impossible.
  • Model isolation: per-tenant LoRA via vLLM’s lora-modules flag, swapped per request.
  • Audit logs: tenant_id in every request log.

Model routing

LiteLLM config routes by virtual model name:

model_list:
  - model_name: chat-fast
    litellm_params:
      model: openai/mistral-7b
      api_base: http://localhost:8000/v1
  - model_name: chat-strong
    litellm_params:
      model: openai/qwen2.5-32b
      api_base: http://localhost:8001/v1

router_settings:
  fallbacks: [{"chat-strong": ["chat-fast"]}]
  routing_strategy: latency-based-routing

Cost model

Per-tenant cost = base server cost / active tenants + per-tenant LoRA training cost (one-time).

For a 30-tenant deployment on a 5090 at £399/mo: ~£12/tenant/month. Charge £30-50/tenant/month and you have a healthy margin.

Verdict

Multi-tenant chatbot SaaS on self-hosted infrastructure is genuinely viable. The architecture is well-trodden — LiteLLM + vLLM multi-LoRA + Qdrant per-tenant collections — and the unit economics work above ~30 active tenants.

Bottom line

For SaaS products built on chatbots, self-hosting beats per-token APIs starting at ~30 tenants. See API hosting for the deployment side.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?