Table of Contents
Batch Size Benchmark Overview
Batch size, the number of concurrent requests processed simultaneously, is one of the most important tuning parameters for LLM serving. Increasing batch size improves total throughput (tokens/sec across all requests) but increases per-request latency. Running these tests on a dedicated GPU server with consistent hardware eliminates cloud-provider variability.
We benchmarked LLaMA 3 8B Instruct in INT4 (AWQ) using vLLM on GigaGPU servers with continuous batching enabled. Each request uses a 256-token prompt with 128 tokens of output. For precision-level comparisons, see our FP16 vs INT8 vs INT4 speed benchmark.
Throughput by Batch Size and GPU
| GPU | VRAM | Batch 1 (tok/s) | Batch 4 (tok/s) | Batch 8 (tok/s) | Batch 16 (tok/s) |
|---|---|---|---|---|---|
| RTX 3050 | 6 GB | 28 | 62 | OOM | OOM |
| RTX 4060 | 8 GB | 48 | 125 | 165 | OOM |
| RTX 4060 Ti | 16 GB | 58 | 168 | 280 | 385 |
| RTX 3090 | 24 GB | 72 | 215 | 370 | 520 |
| RTX 5080 | 16 GB | 105 | 310 | 505 | 680 |
| RTX 5090 | 32 GB | 138 | 420 | 710 | 980 |
Total throughput scales roughly 2-3x from batch 1 to batch 4, and continues increasing up to batch 16 where VRAM allows. The RTX 3050 runs out of memory at batch 8 because each concurrent request’s KV cache adds roughly 0.5-1GB. The RTX 5090 at batch 16 delivers 980 total tokens per second, nearly 7x its single-request throughput.
Per-Request Latency Impact
Higher batch sizes increase total throughput but slow down individual requests. Below we show per-request tokens per second (total throughput divided by batch size).
| Batch Size | RTX 3090 per-req (tok/s) | RTX 5080 per-req (tok/s) | RTX 5090 per-req (tok/s) |
|---|---|---|---|
| 1 | 72 | 105 | 138 |
| 4 | 54 | 78 | 105 |
| 8 | 46 | 63 | 89 |
| 16 | 33 | 43 | 61 |
| 32 | 22 | 30 | 42 |
At batch 16, per-request speed drops to roughly 30-45% of the single-request speed. For interactive chat applications, keep batch sizes under 8 to maintain responsive latency. For batch processing pipelines where total throughput matters more than individual speed, maximise batch size up to VRAM limits. See the tokens per second benchmark for single-request comparisons across more models.
Cost Efficiency Analysis
| GPU | Max Batch Total tok/s | Approx. Monthly Cost | Total tok/s per Pound |
|---|---|---|---|
| RTX 3050 | 62 (batch 4) | ~£45 | 1.38 |
| RTX 4060 | 165 (batch 8) | ~£60 | 2.75 |
| RTX 4060 Ti | 385 (batch 16) | ~£75 | 5.13 |
| RTX 3090 | 520 (batch 16) | ~£110 | 4.73 |
| RTX 5080 | 680 (batch 16) | ~£160 | 4.25 |
| RTX 5090 | 980 (batch 16) | ~£250 | 3.92 |
The RTX 4060 Ti dominates cost efficiency for batched serving, thanks to its 16GB enabling batch 16 at a low monthly cost. The RTX 3090 provides 35% more absolute throughput for a moderate cost increase.
GPU Recommendations
- Single-user chat: RTX 4060 — 48 tok/s at batch 1 is responsive, and batch 8 handles a few concurrent users.
- Multi-user serving: RTX 4060 Ti — best cost efficiency for batched workloads with 16GB of KV cache space.
- Production API: RTX 5080 — 680 tok/s total at batch 16 serves many concurrent requests.
- High concurrency: RTX 5090 — 980 tok/s with room for batch 32+ thanks to 32GB VRAM.
For quantisation trade-offs, see the FP16 vs INT8 vs INT4 comparison. For larger models, check the RTX 5090 LLaMA 3 70B guide. For general GPU selection, see the best GPU for LLM inference guide or our cheapest GPU for inference analysis. Browse all results in the Benchmarks category.
Conclusion
Batch size is a powerful lever for LLM serving throughput. Moving from batch 1 to batch 16 can deliver 5-7x more total tokens per second on the same hardware. The trade-off is per-request latency, which drops to roughly a third of single-request speed at batch 16. For production deployments, the right batch size depends on your concurrency needs and acceptable latency. VRAM capacity determines the maximum viable batch size, making cards like the RTX 5090 and RTX 3090 ideal for high-concurrency serving.
Serve LLMs at Scale on Dedicated GPU Servers
High-VRAM GPU servers for multi-user LLM serving. UK hosting with full root access.
Browse GPU Servers