Most advice about multi-GPU servers assumes all cards are identical. On our dedicated hosting a useful pattern is mixing GPU tiers in one chassis – a fast modern card for latency-critical work and an older card for bulk batch. Heterogeneous setups are legal, cheaper than homogeneous, and sometimes better.
Topics
The Pattern
You have two workloads with different SLAs. A latency-critical one (customer-facing chat) and a batch one (overnight summarisation of documents). One RTX 5090 handles the chat. One RTX 3090 handles the batch. Neither workload competes with the other for VRAM or compute. Total cost is lower than two 5090s and batch capacity is higher than one 5090 alone.
Which Cards Mix Well
| Mix | Good For |
|---|---|
| 5090 + 3090 | Hot path + cold path, CUDA everywhere |
| 6000 Pro + 4060 Ti | Big LLM + small utility (embeddings, rerankers) |
| 5090 + 4060 Ti | SDXL + LLM split |
| Two 3090s + 4060 Ti | TP pair for 70B + utility card |
Do not mix vendors in one chassis when doing tensor parallel – ROCm and CUDA do not share a process. Different-vendor cards can coexist as independent workloads but not as a split model.
What to Avoid
Do not attempt tensor parallel across heterogeneous cards. vLLM will either refuse or produce bizarre performance – whichever GPU is slower becomes the bottleneck for every forward pass. Model sharding assumes roughly equal compute and memory on each participant.
Data parallel is where heterogeneous shines – each card runs independently and the load balancer can route to the right tier based on request type.
Custom Multi-GPU Chassis
Mix cards, mix tiers, match your workload mix – we build the chassis to spec.
Browse GPU ServersWorked Example
A SaaS serving 500 end-users with a 13B chat model (latency target sub-3s) and a batch pipeline that summarises 100,000 support tickets nightly:
- GPU 0: RTX 5080 running vLLM on Llama 3 8B INT8, port 8001
- GPU 1: RTX 4060 Ti 16GB running vLLM on Llama 3 8B INT4 for batch, port 8002
- Load balancer routes user chat to 8001, batch workers to 8002
Cost of this chassis sits below two 5080s. Chat latency is unaffected by batch load – the cards are physically separate. The 4060 Ti never starves the 5080.
For the single-card-versus-multi-card question see single 6000 Pro vs four 4060 Ti, and for workload split logic see SDXL vs LLM split.