If you have fine-tuned many small LoRA adapters on top of one base model, running each variant as a separate vLLM instance wastes VRAM. LoRAX (and similar multi-LoRA serving engines) let you load one base model and dynamically apply the right adapter per request on dedicated GPU hosting.
Contents
When Multi-LoRA Pays
A SaaS with per-tenant fine-tunes: 30 customers, each with a LoRA trained on their documents. Running 30 separate 8B model instances requires 30x the VRAM. LoRAX loads one base 8B model plus 30 small LoRA adapters (typically 50-200 MB each) and switches adapter per request. Total VRAM: one model + 30 tiny adapters.
How It Works
Base model weights live on the GPU. Each LoRA adapter is a pair of low-rank matrices. At inference time, the engine computes the effective weight as base + adapter for each request’s assigned adapter. With efficient batched linear algebra, multiple adapters can run concurrently in the same batch.
Setup
LoRAX has a Docker-based deployment:
docker run -p 8080:80 -v $PWD/adapters:/adapters \
--gpus all ghcr.io/predibase/lorax:latest \
--model-id meta-llama/Llama-3.1-8B-Instruct \
--adapter-source local \
--source /adapters
Place adapters in ./adapters/{adapter_name}/. Request specifies which adapter per call:
curl http://localhost:8080/generate \
-d '{"inputs":"Hello","parameters":{"adapter_id":"tenant-42"}}'
Limits
LoRAX works well up to roughly 50-100 concurrent adapters on a modern card. Beyond that, adapter loading latency and VRAM pressure add up. Each adapter must be from the same base model – you cannot mix Llama 3 8B adapters with Mistral 7B adapters on one LoRAX instance.
| Pattern | Fits on Dedicated GPU |
|---|---|
| 1 base + 5-10 LoRAs | Any card that holds base, easy |
| 1 base + 30-50 LoRAs | 24-32 GB card comfortable |
| 1 base + 100+ LoRAs | 48 GB+ card, expect some cold-start latency |
Multi-Tenant LoRA Hosting Made Simple
One dedicated GPU can serve dozens of fine-tuned variants – we help configure LoRAX end-to-end.
Browse GPU ServersFor single-LoRA serving the base vLLM path also works. See QLoRA training and AI for agencies multi-client.