Home / Blog / AI Hosting & Infrastructure / Batch Size Scaling on Multi-GPU LLM Servers

AI Hosting & Infrastructure

Batch Size Scaling on Multi-GPU LLM Servers

More GPUs means bigger batches - but the curve is not linear and the right batch size shifts with your topology.

AI Hosting & Infrastructure April 19, 2026 2 min read admin

On a single-GPU LLM server, batch size is tuned to saturate the card. On a multi-GPU server, the right batch size depends on the parallelism pattern and where the bottleneck sits. Getting this wrong leaves 30-50% of your dedicated GPU throughput on the table.

Topics

Single-GPU

On a RTX 5090 serving Llama 3 8B INT8 through vLLM, throughput scales with batch size up to roughly 32-64 concurrent sequences then plateaus. Beyond the plateau, KV cache memory becomes the bottleneck – more sequences just evict each other.

Tensor Parallel

Two 5090s in tensor-parallel on Llama 3 70B INT4: the right batch size is often lower than on a single card would suggest. Each forward pass now includes PCIe all-reduce, which is mostly fixed cost per pass – so smaller batches amortise it worse. Sweet spot: 24-48 concurrent sequences.

Data Parallel

Two cards running independent vLLM instances: each card tunes as if single-GPU. Aggregate throughput is 2x single card. No shared bottleneck. Usually the easiest topology to tune because single-GPU learnings transfer directly.

Topology	Recommended Initial Batch
Single GPU, 7-13B	32-64
Single GPU, 70B INT4 on 96GB	16-32
TP=2, 70B INT4	24-48
DP=2, 7-13B	32-64 per replica, 64-128 aggregate
TP=4, 70B FP8 or Mixtral 8x22B	32-64 (limited by KV cache)

Pre-Tuned Multi-GPU Servers

We set vLLM batch and concurrency parameters to your model and topology before handoff.

Browse GPU Servers

Tuning

Measure. Run 16, 32, 64, 128 concurrent requests with a realistic prompt and output length. Record tokens/sec per request and aggregate. Pick the batch that maximises aggregate while keeping per-request latency under your SLA.

Two vLLM parameters matter:

--max-num-seqs 64
--max-num-batched-tokens 8192

max-num-seqs caps concurrent sequences. max-num-batched-tokens caps prefill work per step. The second often matters more for prefill-heavy workloads (RAG with long retrieved context).

See vLLM continuous batching tuning and scaling vLLM across two GPUs.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

AI Hosting & Infrastructure

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Batch Size Scaling on Multi-GPU LLM Servers

Topics

Single-GPU

Tensor Parallel

Data Parallel

Pre-Tuned Multi-GPU Servers

Tuning

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Batch Size Scaling on Multi-GPU LLM Servers

Topics

Single-GPU

Tensor Parallel

Data Parallel

Pre-Tuned Multi-GPU Servers

Tuning

Need a Dedicated GPU Server?

admin

Related Articles

Heterogeneous Multi-GPU Workload Split – Different Cards, One Server

How Much Storage Do You Need for AI Models?

Multi-GPU Server Setup for Large Model Inference

GPU Server for 500 Concurrent Image generation Users: Sizing Guide

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?