RTX 3050 - Order Now
Home / Blog / Tutorials / Multi-Model Pipeline on One GPU
Tutorials

Multi-Model Pipeline on One GPU

Run multiple AI models (LLM, vision, embedding, TTS) on a single GPU by managing VRAM allocation, model loading, and inference scheduling without running out of memory.

You will run four models on a single GPU simultaneously: an LLM for text generation, a vision model for image analysis, an embedding model for search, and a TTS model for audio. A startup serving all four capabilities from one RTX 5090 (24 GB) saves over 70% versus renting four separate GPU instances. The key is careful VRAM management and inference scheduling on a dedicated GPU server.

VRAM Budget Planning

ModelPrecisionVRAM (idle)VRAM (inference)
LLaMA 3.1 8B (Q4)4-bit GPTQ~5.5 GB~7.5 GB
Florence-2-largefloat16~1.5 GB~2.5 GB
BGE-large-enfloat16~1.3 GB~1.8 GB
Coqui XTTS-v2float16~1.8 GB~2.5 GB
Total~10.1 GB~14.3 GB

With all four models loaded, peak VRAM usage stays under 15 GB, leaving 9 GB of headroom on a 24 GB GPU for KV cache and batch processing.

Lazy Model Loading

import torch
from threading import Lock

class ModelManager:
    def __init__(self, max_vram_gb: float = 22.0):
        self.models = {}
        self.locks = {}
        self.max_vram = max_vram_gb * 1024**3
        self.load_order = []  # LRU tracking

    def register(self, name: str, loader_fn, vram_gb: float):
        self.models[name] = {
            "loader": loader_fn, "model": None,
            "vram": vram_gb, "lock": Lock()
        }

    def get(self, name: str):
        entry = self.models[name]
        with entry["lock"]:
            if entry["model"] is None:
                self._ensure_vram(entry["vram"])
                entry["model"] = entry["loader"]()
                self.load_order.append(name)
            else:
                # Move to end of LRU
                if name in self.load_order:
                    self.load_order.remove(name)
                self.load_order.append(name)
            return entry["model"]

    def _ensure_vram(self, needed_gb: float):
        free = torch.cuda.mem_get_info()[0] / 1024**3
        while free < needed_gb and self.load_order:
            oldest = self.load_order.pop(0)
            self._unload(oldest)
            free = torch.cuda.mem_get_info()[0] / 1024**3

    def _unload(self, name: str):
        entry = self.models[name]
        if entry["model"] is not None:
            del entry["model"]
            entry["model"] = None
            torch.cuda.empty_cache()

manager = ModelManager(max_vram_gb=22.0)

The model manager uses LRU eviction: when VRAM runs low, the least recently used model gets unloaded. For workloads where all models are needed constantly, keep them resident and rely on quantisation to fit within budget.

Loading Each Model

# Register all models with their loaders
manager.register("llm", lambda: load_vllm_model(
    "meta-llama/Llama-3.1-8B-Instruct", quantization="gptq"), vram_gb=7.5)

manager.register("vision", lambda: load_florence(
    "microsoft/Florence-2-large"), vram_gb=2.5)

manager.register("embeddings", lambda: load_sentence_transformer(
    "BAAI/bge-large-en-v1.5"), vram_gb=1.8)

manager.register("tts", lambda: load_coqui_xtts(
    "tts_models/multilingual/multi-dataset/xtts_v2"), vram_gb=2.5)

# Usage
def process_request(request_type: str, data: dict):
    if request_type == "chat":
        model = manager.get("llm")
        return model.generate(data["prompt"])
    elif request_type == "describe_image":
        model = manager.get("vision")
        return model.describe(data["image"])
    elif request_type == "embed":
        model = manager.get("embeddings")
        return model.encode(data["texts"])
    elif request_type == "speak":
        model = manager.get("tts")
        return model.synthesize(data["text"])

Request Scheduling

from fastapi import FastAPI
import asyncio

app = FastAPI()
semaphore = asyncio.Semaphore(1)  # Serialise GPU-heavy operations

@app.post("/inference")
async def inference(request_type: str, data: dict):
    async with semaphore:
        result = await asyncio.to_thread(
            process_request, request_type, data
        )
    return {"result": result}

# For batch workloads, group by model type to minimise swaps
@app.post("/batch")
async def batch_inference(requests: list):
    # Sort by type to minimise model loading/unloading
    sorted_reqs = sorted(requests, key=lambda r: r["type"])
    results = []
    for req in sorted_reqs:
        async with semaphore:
            result = await asyncio.to_thread(
                process_request, req["type"], req["data"]
            )
            results.append(result)
    return {"results": results}

The semaphore prevents concurrent GPU inference that would cause OOM errors. Batch requests are sorted by model type to minimise VRAM thrashing from loading and unloading models.

Monitoring and Production

Track VRAM usage per model with torch.cuda.memory_allocated() and expose metrics via Prometheus. Set alerts when free VRAM drops below 2 GB. On a 24 GB RTX 5090, all four models run comfortably with room for growth. For heavier LLMs (70B parameters), upgrade to an RTX 6000 Pro (48 GB) or RTX 6000 Pro (80 GB). Deploy on private infrastructure for data isolation. See vLLM hosting for optimised LLM serving, Whisper hosting for adding speech-to-text, chatbot hosting, GDPR compliance, more tutorials, and use cases.

Multi-Model GPU Servers

Run multiple AI models on a single dedicated UK GPU server with intelligent VRAM management.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?