Four-GPU dedicated servers on our hosting are the sweet spot between single-card simplicity and rack-scale complexity. The temptation is to run one large model across all four with tensor parallelism. That is usually the wrong call. Here are the three topologies that pay back.
Topologies
- Tensor parallel over all four
- Data parallel with four independent replicas
- Mixed tensor plus data parallel
- Which to pick
Tensor Parallel Over All Four
One model, split across all four GPUs. Memory aggregate is 4x single card. Good for models that genuinely need more memory than any single GPU provides. On four RTX 4060 Tis (64 GB aggregate) you can host 70B INT4 with headroom. The catch: every forward pass now crosses three PCIe hops. Throughput at batch 1 is worse than a single 32 GB card could deliver on a smaller model.
Data Parallel – Four Independent Replicas
Load the same model on each card independently. Front with a load balancer. Four requests run in parallel. No interconnect overhead. Total aggregate throughput is strictly 4x a single card. Requires the model to fit on one GPU. For a 7-13B class model on four 5080s, this pattern crushes tensor parallel throughput by 40-60%.
Mixed TP Plus DP
Two pairs of GPUs, tensor parallel within each pair, data parallel across pairs. Good when the model needs two cards to fit but you want more throughput than a single TP-2 pair delivers. On four 5090s: two TP-2 pairs each running 70B INT4, load balanced. Higher aggregate throughput than TP-4 at the cost of running two vLLM instances.
| Pattern | Memory Aggregate | Throughput | Complexity |
|---|---|---|---|
| TP-4 | Max | Lower per request, fine at high batch | Low |
| DP-4 | 1x card | Highest, scales linearly | Medium (load balancer) |
| TP-2 × DP-2 | 2x card | Middle | Highest |
Four-GPU Chassis With Tuned Networking
PCIe lane-optimised four-GPU servers on fixed monthly UK pricing.
Browse GPU ServersWhich to Pick
If your model fits on one card: data parallel. Every time. Tensor parallel on a model that already fits is almost always wasted interconnect traffic.
If your model needs two cards: consider TP-2 with data parallel across two pairs.
If your model needs all four cards: you are in TP-4 territory. Also consider whether a single 6000 Pro with 96 GB would serve you better – see single 6000 Pro vs four 4060 Ti.
For specific NCCL tuning on these topologies see NCCL tuning.