RTX 3050 - Order Now
Home / Blog / Tutorials / AI Inference: Batch Throughput vs Latency Trade-Off Explained
Tutorials

AI Inference: Batch Throughput vs Latency Trade-Off Explained

Continuous batching trades latency for throughput. The right point on that curve depends on your workload. Here is how to tune it.

vLLM's continuous batching is the throughput superpower of modern LLM serving. It also adds latency variance. The trade-off matters in production.

TL;DR

For high-concurrency: max-num-seqs high, max-num-batched-tokens high. For latency-sensitive: max-num-seqs lower, prefer single-stream throughput. The right answer depends on your traffic profile.

The trade-off

  • Larger batches → higher aggregate throughput → higher per-request latency variance
  • Smaller batches → lower aggregate, more consistent latency

Tuning knobs

  • --max-num-seqs: max concurrent sequences (lower = better latency, higher = better throughput)
  • --max-num-batched-tokens: per-step token budget (smaller = lower max latency)
  • --enable-chunked-prefill: split long prompts to reduce prefill latency spikes

Verdict

Tune by traffic profile. Latency-sensitive chatbots: max-num-seqs ~32. High-throughput batch jobs: max-num-seqs 128+.

Bottom line

The default isn't universal. Tune. See batch size tuning.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?