Home / Blog / Benchmarks / Batch Size Impact on LLM Tokens/sec by GPU

Benchmarks

Batch Size Impact on LLM Tokens/sec by GPU

Benchmark data showing how batch size affects LLM inference throughput across six GPUs, with total and per-request tokens per second analysis.

Benchmarks April 14, 2026 3 min read admin

Table of Contents

Batch Size Benchmark Overview
Throughput by Batch Size and GPU
Per-Request Latency Impact
Cost Efficiency Analysis
GPU Recommendations
Conclusion

Batch Size Benchmark Overview

Batch size, the number of concurrent requests processed simultaneously, is one of the most important tuning parameters for LLM serving. Increasing batch size improves total throughput (tokens/sec across all requests) but increases per-request latency. Running these tests on a dedicated GPU server with consistent hardware eliminates cloud-provider variability.

We benchmarked LLaMA 3 8B Instruct in INT4 (AWQ) using vLLM on GigaGPU servers with continuous batching enabled. Each request uses a 256-token prompt with 128 tokens of output. For precision-level comparisons, see our FP16 vs INT8 vs INT4 speed benchmark.

Throughput by Batch Size and GPU

GPU	VRAM	Batch 1 (tok/s)	Batch 4 (tok/s)	Batch 8 (tok/s)	Batch 16 (tok/s)
RTX 3050	6 GB	28	62	OOM	OOM
RTX 4060	8 GB	48	125	165	OOM
RTX 4060 Ti	16 GB	58	168	280	385
RTX 3090	24 GB	72	215	370	520
RTX 5080	16 GB	105	310	505	680
RTX 5090	32 GB	138	420	710	980

Total throughput scales roughly 2-3x from batch 1 to batch 4, and continues increasing up to batch 16 where VRAM allows. The RTX 3050 runs out of memory at batch 8 because each concurrent request’s KV cache adds roughly 0.5-1GB. The RTX 5090 at batch 16 delivers 980 total tokens per second, nearly 7x its single-request throughput.

Per-Request Latency Impact

Higher batch sizes increase total throughput but slow down individual requests. Below we show per-request tokens per second (total throughput divided by batch size).

Batch Size	RTX 3090 per-req (tok/s)	RTX 5080 per-req (tok/s)	RTX 5090 per-req (tok/s)
1	72	105	138
4	54	78	105
8	46	63	89
16	33	43	61
32	22	30	42

At batch 16, per-request speed drops to roughly 30-45% of the single-request speed. For interactive chat applications, keep batch sizes under 8 to maintain responsive latency. For batch processing pipelines where total throughput matters more than individual speed, maximise batch size up to VRAM limits. See the tokens per second benchmark for single-request comparisons across more models.

Cost Efficiency Analysis

GPU	Max Batch Total tok/s	Approx. Monthly Cost	Total tok/s per Pound
RTX 3050	62 (batch 4)	~£45	1.38
RTX 4060	165 (batch 8)	~£60	2.75
RTX 4060 Ti	385 (batch 16)	~£75	5.13
RTX 3090	520 (batch 16)	~£110	4.73
RTX 5080	680 (batch 16)	~£160	4.25
RTX 5090	980 (batch 16)	~£250	3.92

The RTX 4060 Ti dominates cost efficiency for batched serving, thanks to its 16GB enabling batch 16 at a low monthly cost. The RTX 3090 provides 35% more absolute throughput for a moderate cost increase.

GPU Recommendations

Single-user chat: RTX 4060 — 48 tok/s at batch 1 is responsive, and batch 8 handles a few concurrent users.
Multi-user serving: RTX 4060 Ti — best cost efficiency for batched workloads with 16GB of KV cache space.
Production API: RTX 5080 — 680 tok/s total at batch 16 serves many concurrent requests.
High concurrency: RTX 5090 — 980 tok/s with room for batch 32+ thanks to 32GB VRAM.

For quantisation trade-offs, see the FP16 vs INT8 vs INT4 comparison. For larger models, check the RTX 5090 LLaMA 3 70B guide. For general GPU selection, see the best GPU for LLM inference guide or our cheapest GPU for inference analysis. Browse all results in the Benchmarks category.

Conclusion

Batch size is a powerful lever for LLM serving throughput. Moving from batch 1 to batch 16 can deliver 5-7x more total tokens per second on the same hardware. The trade-off is per-request latency, which drops to roughly a third of single-request speed at batch 16. For production deployments, the right batch size depends on your concurrency needs and acceptable latency. VRAM capacity determines the maximum viable batch size, making cards like the RTX 5090 and RTX 3090 ideal for high-concurrency serving.

Serve LLMs at Scale on Dedicated GPU Servers

High-VRAM GPU servers for multi-user LLM serving. UK hosting with full root access.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Batch Size Impact on LLM Tokens/sec by GPU

Batch Size Benchmark Overview

Throughput by Batch Size and GPU

Per-Request Latency Impact

Cost Efficiency Analysis

GPU Recommendations

Conclusion

Serve LLMs at Scale on Dedicated GPU Servers

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Batch Size Impact on LLM Tokens/sec by GPU

Batch Size Benchmark Overview

Throughput by Batch Size and GPU

Per-Request Latency Impact

Cost Efficiency Analysis

GPU Recommendations

Conclusion

Serve LLMs at Scale on Dedicated GPU Servers

Need a Dedicated GPU Server?

admin

Related Articles

Phi-3 Benchmarks: Performance on GigaGPU Servers

GPU Memory During Inference by Model

Embedding Speed: GPU vs CPU Benchmark

Mistral 7B GPTQ vs AWQ vs GGUF: Speed Comparison

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?