RTX 3050 - Order Now
Home / Blog / Tutorials / vLLM Multi-LoRA Deployment
Tutorials

vLLM Multi-LoRA Deployment

vLLM's native multi-LoRA support — serve many fine-tuned variants from one base model. The right deployment for SaaS multi-tenancy.

vLLM 0.4+ has native multi-LoRA support: load a base model, attach multiple LoRA adapters, route each request to its appropriate adapter. The economics for per-tenant fine-tuning shift dramatically — instead of one GPU per fine-tune, one GPU serves dozens of fine-tunes.

TL;DR

Use vllm serve --enable-lora --max-loras 30 --max-lora-rank 64. Adapters loaded dynamically per request via model field in API call. Per-adapter VRAM cost: ~50-200 MB (vs ~14 GB for separate base + LoRA model). For SaaS with per-tenant fine-tunes, this is the multi-tenant economics enabler.

How it works

  1. Start vLLM with base model + multi-LoRA enabled
  2. Adapters live in HF Hub or local filesystem
  3. API request specifies adapter via model field
  4. vLLM dynamically loads requested adapter (cold-load ~50-200 ms first time)
  5. Subsequent requests hit warm cached adapters
  6. LRU eviction when max-loras reached

Setup

vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \\
  --quantization fp8 \\
  --enable-lora \\
  --max-loras 30 \\
  --max-lora-rank 64 \\
  --port 8000

Client side: specify adapter ID in the model field:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

# Adapter loaded from HF Hub or local path
resp = client.chat.completions.create(
    model="your-org/customer-acme-adapter",
    messages=[{"role": "user", "content": "..."}],
)

Performance

  • Cold-load latency: ~50-200 ms first request per adapter
  • Warm-load latency: ~20 ms (adapter swap)
  • Throughput penalty: ~10-15% per concurrent active adapter (more adapters = more SM time on adapter compute)
  • VRAM per adapter: rank-dependent; r=64 ~150 MB, r=32 ~75 MB

Verdict

For SaaS with per-tenant fine-tuning, vLLM multi-LoRA is the economics enabler. £289/mo 4090 + 30 customer fine-tunes = ~£10/customer/mo of infrastructure cost vs ~£280/customer/mo with separate-process serving. Same applies for agency / per-product / per-task customisation.

Bottom line

Multi-LoRA = per-tenant economics enabler. See LoRAX alternative.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?