Benchmark Overview
Running multiple models on a single GPU is increasingly common for production AI services that need a coding assistant, a general chatbot, and an embedding model simultaneously. We tested 2, 3, and 4 model configurations on RTX 5090 (24GB), RTX 6000 Pro (80GB), and RTX 6000 Pro (80GB) GPUs to measure throughput degradation, VRAM limits, and practical serving boundaries on dedicated GPU hosting.
Test Configuration
Models tested: Llama 3 8B (INT4, 5GB VRAM), Mistral 7B (INT4, 4.5GB VRAM), CodeLlama 7B (INT4, 4.5GB VRAM), and BGE-Large embedding model (0.5GB VRAM). Served via vLLM with separate model instances sharing the same GPU. Each model received 10 concurrent requests during benchmarking. See token speed benchmarks for single-model baselines.
Multi-Model Throughput Results
| Configuration | GPU | Total VRAM Used | Throughput per Model | Total Throughput |
|---|---|---|---|---|
| 1 model (8B INT4) | RTX 5090 | 5 GB | 95 tok/s | 95 tok/s |
| 2 models (8B + 7B) | RTX 5090 | 10 GB | 78 tok/s each | 156 tok/s |
| 3 models (8B + 7B + 7B) | RTX 5090 | 14.5 GB | 52 tok/s each | 156 tok/s |
| 2 models (8B + 7B) | RTX 6000 Pro 96 GB | 10 GB | 88 tok/s each | 176 tok/s |
| 4 models (8B + 7B + 7B + embed) | RTX 6000 Pro 96 GB | 15 GB | 72 tok/s each LLM | 225 tok/s |
| 4 models (8B + 7B + 7B + embed) | RTX 6000 Pro 96 GB | 15 GB | 110 tok/s each LLM | 340 tok/s |
VRAM Allocation Strategy
Each model instance requires its own VRAM for weights plus KV cache. With 2 models on a 24GB RTX 5090, approximately 10GB goes to model weights and 14GB remains for KV caches. At 3 models, KV cache space drops to 9.5GB, limiting concurrent users per model to roughly 8-10 at 4K context. The RTX 6000 Pro with 80GB provides comfortable headroom for 4 models with generous KV caches. Select hardware from the GPU guide.
Throughput Degradation Analysis
Adding a second model to a GPU reduces per-model throughput by 15-20% due to memory bandwidth sharing. The third model shows 25-40% degradation. Beyond 3 concurrent LLM instances, degradation becomes steep because GPU compute units are fully contested. The embedding model adds minimal overhead since it runs short, batch-oriented inference rather than autoregressive generation. Use Ollama for simpler multi-model setups or compare serving engines for your use case.
Practical Recommendations
On a 24GB GPU, serve a maximum of 2 LLM instances plus 1 embedding model. On an 80GB GPU, 3-4 LLM instances work well. For heavier multi-model workloads, use multi-GPU clusters with one model per GPU for optimal throughput. Deploy multi-model configurations on GigaGPU dedicated servers with private AI hosting. Check the benchmarks section and LLM hosting guide for more data.