RTX 3050 - Order Now
Home / Blog / Tutorials / LoRAX Multi-LoRA Serving on a Dedicated GPU
Tutorials

LoRAX Multi-LoRA Serving on a Dedicated GPU

One base model, many LoRA adapters, one GPU - how to serve dozens of fine-tuned variants without running dozens of model replicas.

If you have fine-tuned many small LoRA adapters on top of one base model, running each variant as a separate vLLM instance wastes VRAM. LoRAX (and similar multi-LoRA serving engines) let you load one base model and dynamically apply the right adapter per request on dedicated GPU hosting.

Contents

When Multi-LoRA Pays

A SaaS with per-tenant fine-tunes: 30 customers, each with a LoRA trained on their documents. Running 30 separate 8B model instances requires 30x the VRAM. LoRAX loads one base 8B model plus 30 small LoRA adapters (typically 50-200 MB each) and switches adapter per request. Total VRAM: one model + 30 tiny adapters.

How It Works

Base model weights live on the GPU. Each LoRA adapter is a pair of low-rank matrices. At inference time, the engine computes the effective weight as base + adapter for each request’s assigned adapter. With efficient batched linear algebra, multiple adapters can run concurrently in the same batch.

Setup

LoRAX has a Docker-based deployment:

docker run -p 8080:80 -v $PWD/adapters:/adapters \
  --gpus all ghcr.io/predibase/lorax:latest \
  --model-id meta-llama/Llama-3.1-8B-Instruct \
  --adapter-source local \
  --source /adapters

Place adapters in ./adapters/{adapter_name}/. Request specifies which adapter per call:

curl http://localhost:8080/generate \
  -d '{"inputs":"Hello","parameters":{"adapter_id":"tenant-42"}}'

Limits

LoRAX works well up to roughly 50-100 concurrent adapters on a modern card. Beyond that, adapter loading latency and VRAM pressure add up. Each adapter must be from the same base model – you cannot mix Llama 3 8B adapters with Mistral 7B adapters on one LoRAX instance.

PatternFits on Dedicated GPU
1 base + 5-10 LoRAsAny card that holds base, easy
1 base + 30-50 LoRAs24-32 GB card comfortable
1 base + 100+ LoRAs48 GB+ card, expect some cold-start latency

Multi-Tenant LoRA Hosting Made Simple

One dedicated GPU can serve dozens of fine-tuned variants – we help configure LoRAX end-to-end.

Browse GPU Servers

For single-LoRA serving the base vLLM path also works. See QLoRA training and AI for agencies multi-client.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?