Table of Contents
SaaS products built on AI inference traditionally route to OpenAI / Anthropic and pass costs to customers. Self-hosted infrastructure changes the unit economics — but only if the multi-tenant architecture is right.
For multi-tenant chatbot SaaS: LiteLLM as the router (per-tenant API keys + rate limits), vLLM as the engine (multi-LoRA for per-tenant fine-tunes), Qdrant with per-tenant collections for RAG. Self-hosting wins above ~30 active tenants.
The reference architecture
- API gateway: Caddy with TLS, mTLS or JWT auth.
- Router: LiteLLM with per-tenant master keys, per-tenant budgets and rate limits.
- Inference engine: vLLM with
--enable-loraserving a base + per-tenant LoRA adapters. - RAG infra: Qdrant with per-tenant collection isolation. BGE-large + reranker.
- Vector store auth: row-level security via collection naming convention.
- Observability: per-tenant metrics tagged in Prometheus.
- Billing: LiteLLM per-key cost tracking in Postgres.
Tenant isolation
- API key isolation: each tenant gets their own LiteLLM key. Per-key budget caps prevent runaway costs.
- Rate limiting: per-key requests-per-minute and tokens-per-minute.
- Vector isolation: per-tenant Qdrant collection. Application-layer filter ensures cross-tenant query is impossible.
- Model isolation: per-tenant LoRA via vLLM’s lora-modules flag, swapped per request.
- Audit logs: tenant_id in every request log.
Model routing
LiteLLM config routes by virtual model name:
model_list:
- model_name: chat-fast
litellm_params:
model: openai/mistral-7b
api_base: http://localhost:8000/v1
- model_name: chat-strong
litellm_params:
model: openai/qwen2.5-32b
api_base: http://localhost:8001/v1
router_settings:
fallbacks: [{"chat-strong": ["chat-fast"]}]
routing_strategy: latency-based-routing
Cost model
Per-tenant cost = base server cost / active tenants + per-tenant LoRA training cost (one-time).
For a 30-tenant deployment on a 5090 at £399/mo: ~£12/tenant/month. Charge £30-50/tenant/month and you have a healthy margin.
Verdict
Multi-tenant chatbot SaaS on self-hosted infrastructure is genuinely viable. The architecture is well-trodden — LiteLLM + vLLM multi-LoRA + Qdrant per-tenant collections — and the unit economics work above ~30 active tenants.
Bottom line
For SaaS products built on chatbots, self-hosting beats per-token APIs starting at ~30 tenants. See API hosting for the deployment side.