RTX 3050 - Order Now
Home / Blog / Benchmarks / Batch Size Impact on LLM Tokens/sec by GPU
Benchmarks

Batch Size Impact on LLM Tokens/sec by GPU

Benchmark data showing how batch size affects LLM inference throughput across six GPUs, with total and per-request tokens per second analysis.

Batch Size Benchmark Overview

Batch size, the number of concurrent requests processed simultaneously, is one of the most important tuning parameters for LLM serving. Increasing batch size improves total throughput (tokens/sec across all requests) but increases per-request latency. Running these tests on a dedicated GPU server with consistent hardware eliminates cloud-provider variability.

We benchmarked LLaMA 3 8B Instruct in INT4 (AWQ) using vLLM on GigaGPU servers with continuous batching enabled. Each request uses a 256-token prompt with 128 tokens of output. For precision-level comparisons, see our FP16 vs INT8 vs INT4 speed benchmark.

Throughput by Batch Size and GPU

GPUVRAMBatch 1 (tok/s)Batch 4 (tok/s)Batch 8 (tok/s)Batch 16 (tok/s)
RTX 30506 GB2862OOMOOM
RTX 40608 GB48125165OOM
RTX 4060 Ti16 GB58168280385
RTX 309024 GB72215370520
RTX 508016 GB105310505680
RTX 509032 GB138420710980

Total throughput scales roughly 2-3x from batch 1 to batch 4, and continues increasing up to batch 16 where VRAM allows. The RTX 3050 runs out of memory at batch 8 because each concurrent request’s KV cache adds roughly 0.5-1GB. The RTX 5090 at batch 16 delivers 980 total tokens per second, nearly 7x its single-request throughput.

Per-Request Latency Impact

Higher batch sizes increase total throughput but slow down individual requests. Below we show per-request tokens per second (total throughput divided by batch size).

Batch SizeRTX 3090 per-req (tok/s)RTX 5080 per-req (tok/s)RTX 5090 per-req (tok/s)
172105138
45478105
8466389
16334361
32223042

At batch 16, per-request speed drops to roughly 30-45% of the single-request speed. For interactive chat applications, keep batch sizes under 8 to maintain responsive latency. For batch processing pipelines where total throughput matters more than individual speed, maximise batch size up to VRAM limits. See the tokens per second benchmark for single-request comparisons across more models.

Cost Efficiency Analysis

GPUMax Batch Total tok/sApprox. Monthly CostTotal tok/s per Pound
RTX 305062 (batch 4)~£451.38
RTX 4060165 (batch 8)~£602.75
RTX 4060 Ti385 (batch 16)~£755.13
RTX 3090520 (batch 16)~£1104.73
RTX 5080680 (batch 16)~£1604.25
RTX 5090980 (batch 16)~£2503.92

The RTX 4060 Ti dominates cost efficiency for batched serving, thanks to its 16GB enabling batch 16 at a low monthly cost. The RTX 3090 provides 35% more absolute throughput for a moderate cost increase.

GPU Recommendations

  • Single-user chat: RTX 4060 — 48 tok/s at batch 1 is responsive, and batch 8 handles a few concurrent users.
  • Multi-user serving: RTX 4060 Ti — best cost efficiency for batched workloads with 16GB of KV cache space.
  • Production API: RTX 5080 — 680 tok/s total at batch 16 serves many concurrent requests.
  • High concurrency: RTX 5090 — 980 tok/s with room for batch 32+ thanks to 32GB VRAM.

For quantisation trade-offs, see the FP16 vs INT8 vs INT4 comparison. For larger models, check the RTX 5090 LLaMA 3 70B guide. For general GPU selection, see the best GPU for LLM inference guide or our cheapest GPU for inference analysis. Browse all results in the Benchmarks category.

Conclusion

Batch size is a powerful lever for LLM serving throughput. Moving from batch 1 to batch 16 can deliver 5-7x more total tokens per second on the same hardware. The trade-off is per-request latency, which drops to roughly a third of single-request speed at batch 16. For production deployments, the right batch size depends on your concurrency needs and acceptable latency. VRAM capacity determines the maximum viable batch size, making cards like the RTX 5090 and RTX 3090 ideal for high-concurrency serving.

Serve LLMs at Scale on Dedicated GPU Servers

High-VRAM GPU servers for multi-user LLM serving. UK hosting with full root access.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?