RTX 3050 - Order Now
Home / Blog / Tutorials / vLLM Multi-LoRA Serving: One Base Model, N Customer Adapters
Tutorials

vLLM Multi-LoRA Serving: One Base Model, N Customer Adapters

vLLM's --enable-lora lets you serve a base model + multiple LoRA adapters from the same engine. The pattern that makes multi-tenant fine-tuned SaaS practical.

For SaaS products where each customer wants a fine-tuned model, naive deployment requires N model copies in VRAM. vLLM multi-LoRA changes that — the base model stays loaded once, customer-specific adapters swap in at request time.

TL;DR

vLLM with --enable-lora serves up to ~50 LoRA adapters from a single base model. Adapter swap latency is sub-100 ms. VRAM cost per adapter: ~200-400 MB. Practical for multi-tenant chatbot SaaS up to ~30 active tenants per GPU.

Why multi-LoRA matters

Fine-tuned customer models traditionally need their own server each. With multi-LoRA, you serve N customers from one GPU:

  • Base Llama 3.1 8B FP8: ~8 GB
  • Per-tenant LoRA r=64: ~140 MB
  • 30 tenants on one 5090: ~12 GB total

Setup

vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
  --quantization fp8 \
  --enable-lora \
  --max-loras 30 \
  --max-cpu-loras 100 \
  --max-lora-rank 64 \
  --lora-modules \
    customer-a=/data/loras/customer-a \
    customer-b=/data/loras/customer-b

Request specifies the LoRA via the model field:

client.chat.completions.create(
    model="customer-a",   # picks the customer-a LoRA
    messages=[...],
)

Performance overhead

  • Throughput: ~10% drop vs base-only serving
  • TTFT: +20-30 ms for adapter swap (cached after first use)
  • VRAM: 200-400 MB per active adapter (CPU-LoRAs are paged in on demand)

Verdict

Multi-LoRA serving is the architecture that makes per-customer fine-tuning economically viable. ~30 customers per 5090 at ~£12/customer/month server cost.

Bottom line

For multi-tenant chatbot SaaS with custom fine-tunes, vLLM multi-LoRA is the right pattern. See multi-tenant SaaS architecture.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?