On a single-GPU LLM server, batch size is tuned to saturate the card. On a multi-GPU server, the right batch size depends on the parallelism pattern and where the bottleneck sits. Getting this wrong leaves 30-50% of your dedicated GPU throughput on the table.
Topics
Single-GPU
On a RTX 5090 serving Llama 3 8B INT8 through vLLM, throughput scales with batch size up to roughly 32-64 concurrent sequences then plateaus. Beyond the plateau, KV cache memory becomes the bottleneck – more sequences just evict each other.
Tensor Parallel
Two 5090s in tensor-parallel on Llama 3 70B INT4: the right batch size is often lower than on a single card would suggest. Each forward pass now includes PCIe all-reduce, which is mostly fixed cost per pass – so smaller batches amortise it worse. Sweet spot: 24-48 concurrent sequences.
Data Parallel
Two cards running independent vLLM instances: each card tunes as if single-GPU. Aggregate throughput is 2x single card. No shared bottleneck. Usually the easiest topology to tune because single-GPU learnings transfer directly.
| Topology | Recommended Initial Batch |
|---|---|
| Single GPU, 7-13B | 32-64 |
| Single GPU, 70B INT4 on 96GB | 16-32 |
| TP=2, 70B INT4 | 24-48 |
| DP=2, 7-13B | 32-64 per replica, 64-128 aggregate |
| TP=4, 70B FP8 or Mixtral 8x22B | 32-64 (limited by KV cache) |
Pre-Tuned Multi-GPU Servers
We set vLLM batch and concurrency parameters to your model and topology before handoff.
Browse GPU ServersTuning
Measure. Run 16, 32, 64, 128 concurrent requests with a realistic prompt and output length. Record tokens/sec per request and aggregate. Pick the batch that maximises aggregate while keeping per-request latency under your SLA.
Two vLLM parameters matter:
--max-num-seqs 64
--max-num-batched-tokens 8192
max-num-seqs caps concurrent sequences. max-num-batched-tokens caps prefill work per step. The second often matters more for prefill-heavy workloads (RAG with long retrieved context).
See vLLM continuous batching tuning and scaling vLLM across two GPUs.