Table of Contents
vLLM 0.4+ has native multi-LoRA support: load a base model, attach multiple LoRA adapters, route each request to its appropriate adapter. The economics for per-tenant fine-tuning shift dramatically — instead of one GPU per fine-tune, one GPU serves dozens of fine-tunes.
Use vllm serve --enable-lora --max-loras 30 --max-lora-rank 64. Adapters loaded dynamically per request via model field in API call. Per-adapter VRAM cost: ~50-200 MB (vs ~14 GB for separate base + LoRA model). For SaaS with per-tenant fine-tunes, this is the multi-tenant economics enabler.
How it works
- Start vLLM with base model + multi-LoRA enabled
- Adapters live in HF Hub or local filesystem
- API request specifies adapter via
modelfield - vLLM dynamically loads requested adapter (cold-load ~50-200 ms first time)
- Subsequent requests hit warm cached adapters
- LRU eviction when
max-lorasreached
Setup
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \\
--quantization fp8 \\
--enable-lora \\
--max-loras 30 \\
--max-lora-rank 64 \\
--port 8000
Client side: specify adapter ID in the model field:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
# Adapter loaded from HF Hub or local path
resp = client.chat.completions.create(
model="your-org/customer-acme-adapter",
messages=[{"role": "user", "content": "..."}],
)
Performance
- Cold-load latency: ~50-200 ms first request per adapter
- Warm-load latency: ~20 ms (adapter swap)
- Throughput penalty: ~10-15% per concurrent active adapter (more adapters = more SM time on adapter compute)
- VRAM per adapter: rank-dependent; r=64 ~150 MB, r=32 ~75 MB
Verdict
For SaaS with per-tenant fine-tuning, vLLM multi-LoRA is the economics enabler. £289/mo 4090 + 30 customer fine-tunes = ~£10/customer/mo of infrastructure cost vs ~£280/customer/mo with separate-process serving. Same applies for agency / per-product / per-task customisation.
Bottom line
Multi-LoRA = per-tenant economics enabler. See LoRAX alternative.