Table of Contents
For SaaS products where each customer wants a fine-tuned model, naive deployment requires N model copies in VRAM. vLLM multi-LoRA changes that — the base model stays loaded once, customer-specific adapters swap in at request time.
vLLM with --enable-lora serves up to ~50 LoRA adapters from a single base model. Adapter swap latency is sub-100 ms. VRAM cost per adapter: ~200-400 MB. Practical for multi-tenant chatbot SaaS up to ~30 active tenants per GPU.
Why multi-LoRA matters
Fine-tuned customer models traditionally need their own server each. With multi-LoRA, you serve N customers from one GPU:
- Base Llama 3.1 8B FP8: ~8 GB
- Per-tenant LoRA r=64: ~140 MB
- 30 tenants on one 5090: ~12 GB total
Setup
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
--quantization fp8 \
--enable-lora \
--max-loras 30 \
--max-cpu-loras 100 \
--max-lora-rank 64 \
--lora-modules \
customer-a=/data/loras/customer-a \
customer-b=/data/loras/customer-b
Request specifies the LoRA via the model field:
client.chat.completions.create(
model="customer-a", # picks the customer-a LoRA
messages=[...],
)
Performance overhead
- Throughput: ~10% drop vs base-only serving
- TTFT: +20-30 ms for adapter swap (cached after first use)
- VRAM: 200-400 MB per active adapter (CPU-LoRAs are paged in on demand)
Verdict
Multi-LoRA serving is the architecture that makes per-customer fine-tuning economically viable. ~30 customers per 5090 at ~£12/customer/month server cost.
Bottom line
For multi-tenant chatbot SaaS with custom fine-tunes, vLLM multi-LoRA is the right pattern. See multi-tenant SaaS architecture.