RTX 3050 - Order Now
Home / Blog / AI Hosting & Infrastructure / Model Sharding vs Batch Scaling – Which Comes First
AI Hosting & Infrastructure

Model Sharding vs Batch Scaling – Which Comes First

When your workload outgrows one GPU, do you split the model or run more replicas? The decision is almost always misunderstood.

Teams hitting capacity on a single dedicated GPU server often reach for model sharding (tensor parallel) when batch scaling (data parallel) would serve them better. The instinct “use both GPUs to go faster” is right; the assumption “sharding is how you do that” is usually wrong.

Sections

The Right Question

Instead of “how do I scale to two GPUs,” ask “does my model fit on one GPU at the precision I want, with the context length I need, and with enough headroom to batch concurrent requests.” If yes, batch scaling (more replicas) beats sharding. If no, you must shard to even run.

When Sharding Is Required

Sharding is not an optimisation – it is a capacity tool. You shard when the model cannot fit on your largest available GPU. Llama 3 70B INT4 is ~40 GB. On a single 32 GB RTX 5090 it does not fit. You must shard (or step up to a 96 GB card). Sharding adds interconnect overhead, but that is the price of hosting the model at all.

When Batch Scaling Wins

If your model fits on one card, run N replicas on N cards and load-balance. Two observations make this better than sharding:

  • No interconnect tax – each replica is self-contained, zero PCIe traffic during inference.
  • Workload isolation – one replica stalling or rebooting does not affect others.

On a two-card server serving a 7B model, data parallel throughput is almost perfectly 2x single card. Tensor parallel throughput is typically 1.5-1.7x. Same hardware, less efficiency.

Hybrid

Four-card server, model needs two cards: one natural answer is two tensor-parallel-2 replicas, one per GPU pair, behind a load balancer. You pay the interconnect tax on each pair (required) but get linear scaling between pairs. This is usually the best shape for 4-8 GPU servers hosting >single-card models.

Model FitGPUsBest Pattern
Fits on 1 GPU2Data parallel, 2 replicas
Fits on 1 GPU4Data parallel, 4 replicas
Needs 2 GPUs2Tensor parallel = 2
Needs 2 GPUs4TP=2 × DP=2 hybrid
Needs 4 GPUs4TP=4 (or step up to 96 GB card)

Right-Sized Multi-GPU Servers

We size servers to your model and explain the topology trade so you do not over-spend.

Browse GPU Servers

See also data vs tensor parallel in vLLM and four-GPU architecture patterns for specific vLLM configuration.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?