Table of Contents
vLLM's continuous batching is the throughput superpower of modern LLM serving. It also adds latency variance. The trade-off matters in production.
For high-concurrency: max-num-seqs high, max-num-batched-tokens high. For latency-sensitive: max-num-seqs lower, prefer single-stream throughput. The right answer depends on your traffic profile.
The trade-off
- Larger batches → higher aggregate throughput → higher per-request latency variance
- Smaller batches → lower aggregate, more consistent latency
Tuning knobs
--max-num-seqs: max concurrent sequences (lower = better latency, higher = better throughput)--max-num-batched-tokens: per-step token budget (smaller = lower max latency)--enable-chunked-prefill: split long prompts to reduce prefill latency spikes
Verdict
Tune by traffic profile. Latency-sensitive chatbots: max-num-seqs ~32. High-throughput batch jobs: max-num-seqs 128+.
Bottom line
The default isn't universal. Tune. See batch size tuning.