RTX 3050 - Order Now
Home / Blog / AI Hosting & Infrastructure / Batch Size Scaling on Multi-GPU LLM Servers
AI Hosting & Infrastructure

Batch Size Scaling on Multi-GPU LLM Servers

More GPUs means bigger batches - but the curve is not linear and the right batch size shifts with your topology.

On a single-GPU LLM server, batch size is tuned to saturate the card. On a multi-GPU server, the right batch size depends on the parallelism pattern and where the bottleneck sits. Getting this wrong leaves 30-50% of your dedicated GPU throughput on the table.

Topics

Single-GPU

On a RTX 5090 serving Llama 3 8B INT8 through vLLM, throughput scales with batch size up to roughly 32-64 concurrent sequences then plateaus. Beyond the plateau, KV cache memory becomes the bottleneck – more sequences just evict each other.

Tensor Parallel

Two 5090s in tensor-parallel on Llama 3 70B INT4: the right batch size is often lower than on a single card would suggest. Each forward pass now includes PCIe all-reduce, which is mostly fixed cost per pass – so smaller batches amortise it worse. Sweet spot: 24-48 concurrent sequences.

Data Parallel

Two cards running independent vLLM instances: each card tunes as if single-GPU. Aggregate throughput is 2x single card. No shared bottleneck. Usually the easiest topology to tune because single-GPU learnings transfer directly.

TopologyRecommended Initial Batch
Single GPU, 7-13B32-64
Single GPU, 70B INT4 on 96GB16-32
TP=2, 70B INT424-48
DP=2, 7-13B32-64 per replica, 64-128 aggregate
TP=4, 70B FP8 or Mixtral 8x22B32-64 (limited by KV cache)

Pre-Tuned Multi-GPU Servers

We set vLLM batch and concurrency parameters to your model and topology before handoff.

Browse GPU Servers

Tuning

Measure. Run 16, 32, 64, 128 concurrent requests with a realistic prompt and output length. Record tokens/sec per request and aggregate. Pick the batch that maximises aggregate while keeping per-request latency under your SLA.

Two vLLM parameters matter:

--max-num-seqs 64
--max-num-batched-tokens 8192

max-num-seqs caps concurrent sequences. max-num-batched-tokens caps prefill work per step. The second often matters more for prefill-heavy workloads (RAG with long retrieved context).

See vLLM continuous batching tuning and scaling vLLM across two GPUs.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?