Teams hitting capacity on a single dedicated GPU server often reach for model sharding (tensor parallel) when batch scaling (data parallel) would serve them better. The instinct “use both GPUs to go faster” is right; the assumption “sharding is how you do that” is usually wrong.
Sections
The Right Question
Instead of “how do I scale to two GPUs,” ask “does my model fit on one GPU at the precision I want, with the context length I need, and with enough headroom to batch concurrent requests.” If yes, batch scaling (more replicas) beats sharding. If no, you must shard to even run.
When Sharding Is Required
Sharding is not an optimisation – it is a capacity tool. You shard when the model cannot fit on your largest available GPU. Llama 3 70B INT4 is ~40 GB. On a single 32 GB RTX 5090 it does not fit. You must shard (or step up to a 96 GB card). Sharding adds interconnect overhead, but that is the price of hosting the model at all.
When Batch Scaling Wins
If your model fits on one card, run N replicas on N cards and load-balance. Two observations make this better than sharding:
- No interconnect tax – each replica is self-contained, zero PCIe traffic during inference.
- Workload isolation – one replica stalling or rebooting does not affect others.
On a two-card server serving a 7B model, data parallel throughput is almost perfectly 2x single card. Tensor parallel throughput is typically 1.5-1.7x. Same hardware, less efficiency.
Hybrid
Four-card server, model needs two cards: one natural answer is two tensor-parallel-2 replicas, one per GPU pair, behind a load balancer. You pay the interconnect tax on each pair (required) but get linear scaling between pairs. This is usually the best shape for 4-8 GPU servers hosting >single-card models.
| Model Fit | GPUs | Best Pattern |
|---|---|---|
| Fits on 1 GPU | 2 | Data parallel, 2 replicas |
| Fits on 1 GPU | 4 | Data parallel, 4 replicas |
| Needs 2 GPUs | 2 | Tensor parallel = 2 |
| Needs 2 GPUs | 4 | TP=2 × DP=2 hybrid |
| Needs 4 GPUs | 4 | TP=4 (or step up to 96 GB card) |
Right-Sized Multi-GPU Servers
We size servers to your model and explain the topology trade so you do not over-spend.
Browse GPU ServersSee also data vs tensor parallel in vLLM and four-GPU architecture patterns for specific vLLM configuration.