The economics of a SaaS AI product hinge on how many customers you can pack onto one GPU without ruining tail latency. The RTX 5060 Ti 16GB on UK dedicated GPU hosting – Blackwell GB206, 16 GB GDDR7, native FP8 – supports 30-50 per-tenant LoRA adapters on top of a shared Llama 3.1 8B FP8 base, giving every customer the feeling of a bespoke model on shared hardware.
Contents
Serving pattern
One vLLM or LoRAX process owns the card. A single base model (Llama 3.1 8B FP8, 9.2 GB) handles the bulk of VRAM; per-tenant LoRA adapters at rank 16 weigh 30-80 MB each and stream into a shared adapter pool at request time. A lightweight nginx layer keyed on API key routes each call to the correct adapter ID and enforces per-tenant rate zones.
Per-tenant LoRAs
| LoRA rank | Adapter size | Max resident | Swap latency |
|---|---|---|---|
| 8 | 18 MB | ~200 | 12 ms |
| 16 | 36 MB | ~100 | 22 ms |
| 32 | 72 MB | ~50 | 45 ms |
| 64 | 144 MB | ~25 | 90 ms |
At rank 16 you keep 100 adapters hot in VRAM while Llama 3.1 8B FP8 continues serving at near-base throughput. LoRAX handles cold-loading on miss from NVMe in under 50 ms.
Isolation strategies
- Authentication – per-tenant API keys mapped to adapter IDs and rate-limit zones.
- Rate limiting – nginx
limit_req_zoneper tenant; token-bucket on request count and total tokens/minute. - Quota accounting – Prometheus counters per tenant for tokens in, tokens out, adapter load events.
- Data isolation – one Qdrant collection or Postgres pgvector schema per tenant; row-level security policies.
- Noisy-neighbour control – per-tenant max concurrent requests; burst cap via leaky bucket.
The 5060 Ti does not support hardware MIG partitioning, so isolation is logical rather than physical – sufficient for B2B SaaS where tenants trust the provider.
Capacity per tenant
| Tenant tier | Rate limit | Concurrent | Supported tenants |
|---|---|---|---|
| Starter | 20k tokens/hour | 2 | ~200 |
| Growth | 120k tokens/hour | 5 | ~50 |
| Pro | 600k tokens/hour | 10 | ~10 |
Mix tiers to match your customer pyramid. With Llama 3.1 8B FP8 aggregating 720 t/s at batch 32, one 5060 Ti sustains roughly 2.6M billable tokens per hour at 50% utilisation.
Per-tenant RAG
Co-host a BGE-base embedder (10,200 texts/sec) and BGE-reranker-base (3,200 pairs/sec) on the same card to give each tenant their own RAG corpus without an extra GPU. Index data goes to per-tenant Qdrant collections; the embedding and reranker calls run through the shared endpoints with tenant-scoped collection names. See our SaaS RAG architecture for the full build.
Multi-tenant AI SaaS on Blackwell 16GB
30-50 LoRA tenants on one base model. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: vLLM setup, FP8 Llama deployment, embedding server, reranker server.