You will run four models on a single GPU simultaneously: an LLM for text generation, a vision model for image analysis, an embedding model for search, and a TTS model for audio. A startup serving all four capabilities from one RTX 5090 (24 GB) saves over 70% versus renting four separate GPU instances. The key is careful VRAM management and inference scheduling on a dedicated GPU server.
VRAM Budget Planning
| Model | Precision | VRAM (idle) | VRAM (inference) |
|---|---|---|---|
| LLaMA 3.1 8B (Q4) | 4-bit GPTQ | ~5.5 GB | ~7.5 GB |
| Florence-2-large | float16 | ~1.5 GB | ~2.5 GB |
| BGE-large-en | float16 | ~1.3 GB | ~1.8 GB |
| Coqui XTTS-v2 | float16 | ~1.8 GB | ~2.5 GB |
| Total | ~10.1 GB | ~14.3 GB |
With all four models loaded, peak VRAM usage stays under 15 GB, leaving 9 GB of headroom on a 24 GB GPU for KV cache and batch processing.
Lazy Model Loading
import torch
from threading import Lock
class ModelManager:
def __init__(self, max_vram_gb: float = 22.0):
self.models = {}
self.locks = {}
self.max_vram = max_vram_gb * 1024**3
self.load_order = [] # LRU tracking
def register(self, name: str, loader_fn, vram_gb: float):
self.models[name] = {
"loader": loader_fn, "model": None,
"vram": vram_gb, "lock": Lock()
}
def get(self, name: str):
entry = self.models[name]
with entry["lock"]:
if entry["model"] is None:
self._ensure_vram(entry["vram"])
entry["model"] = entry["loader"]()
self.load_order.append(name)
else:
# Move to end of LRU
if name in self.load_order:
self.load_order.remove(name)
self.load_order.append(name)
return entry["model"]
def _ensure_vram(self, needed_gb: float):
free = torch.cuda.mem_get_info()[0] / 1024**3
while free < needed_gb and self.load_order:
oldest = self.load_order.pop(0)
self._unload(oldest)
free = torch.cuda.mem_get_info()[0] / 1024**3
def _unload(self, name: str):
entry = self.models[name]
if entry["model"] is not None:
del entry["model"]
entry["model"] = None
torch.cuda.empty_cache()
manager = ModelManager(max_vram_gb=22.0)
The model manager uses LRU eviction: when VRAM runs low, the least recently used model gets unloaded. For workloads where all models are needed constantly, keep them resident and rely on quantisation to fit within budget.
Loading Each Model
# Register all models with their loaders
manager.register("llm", lambda: load_vllm_model(
"meta-llama/Llama-3.1-8B-Instruct", quantization="gptq"), vram_gb=7.5)
manager.register("vision", lambda: load_florence(
"microsoft/Florence-2-large"), vram_gb=2.5)
manager.register("embeddings", lambda: load_sentence_transformer(
"BAAI/bge-large-en-v1.5"), vram_gb=1.8)
manager.register("tts", lambda: load_coqui_xtts(
"tts_models/multilingual/multi-dataset/xtts_v2"), vram_gb=2.5)
# Usage
def process_request(request_type: str, data: dict):
if request_type == "chat":
model = manager.get("llm")
return model.generate(data["prompt"])
elif request_type == "describe_image":
model = manager.get("vision")
return model.describe(data["image"])
elif request_type == "embed":
model = manager.get("embeddings")
return model.encode(data["texts"])
elif request_type == "speak":
model = manager.get("tts")
return model.synthesize(data["text"])
Request Scheduling
from fastapi import FastAPI
import asyncio
app = FastAPI()
semaphore = asyncio.Semaphore(1) # Serialise GPU-heavy operations
@app.post("/inference")
async def inference(request_type: str, data: dict):
async with semaphore:
result = await asyncio.to_thread(
process_request, request_type, data
)
return {"result": result}
# For batch workloads, group by model type to minimise swaps
@app.post("/batch")
async def batch_inference(requests: list):
# Sort by type to minimise model loading/unloading
sorted_reqs = sorted(requests, key=lambda r: r["type"])
results = []
for req in sorted_reqs:
async with semaphore:
result = await asyncio.to_thread(
process_request, req["type"], req["data"]
)
results.append(result)
return {"results": results}
The semaphore prevents concurrent GPU inference that would cause OOM errors. Batch requests are sorted by model type to minimise VRAM thrashing from loading and unloading models.
Monitoring and Production
Track VRAM usage per model with torch.cuda.memory_allocated() and expose metrics via Prometheus. Set alerts when free VRAM drops below 2 GB. On a 24 GB RTX 5090, all four models run comfortably with room for growth. For heavier LLMs (70B parameters), upgrade to an RTX 6000 Pro (48 GB) or RTX 6000 Pro (80 GB). Deploy on private infrastructure for data isolation. See vLLM hosting for optimised LLM serving, Whisper hosting for adding speech-to-text, chatbot hosting, GDPR compliance, more tutorials, and use cases.
Multi-Model GPU Servers
Run multiple AI models on a single dedicated UK GPU server with intelligent VRAM management.
Browse GPU Servers