vLLM continuous batching is the feature that makes serving economics viable. It is also the feature most teams leave poorly-tuned. On dedicated GPU servers, a few hours of tuning typically returns 30-50% more throughput from the same hardware. Here is the recipe.
Contents
What Continuous Batching Does
Without continuous batching, a serving engine picks a batch of requests, runs them to completion, then picks the next batch. If requests have very different output lengths, the short ones wait for the long ones. Continuous batching lets finished sequences drop out and new sequences join in mid-flight. Utilisation goes way up.
The Knobs
--max-num-seqs 128
--max-num-batched-tokens 16384
--gpu-memory-utilization 0.92
–max-num-seqs: maximum concurrent sequences. Higher = more concurrency but more KV cache pressure. Starts at 128 default; often right at 32-256 depending on model and GPU.
–max-num-batched-tokens: cap on prefill tokens per iteration. Large prompts get chunked across steps. Higher values speed up long-prompt workloads but risk starving decode.
–gpu-memory-utilization: fraction of VRAM vLLM will use. Default 0.9. Raise to 0.93-0.95 if you know the exact VRAM footprint of other workloads.
How to Measure
Run a load test with realistic prompt and completion distributions. Record four metrics:
- Requests per second (RPS) at saturation
- Tokens per second aggregate
- p50 and p99 time-to-first-token
- p50 and p99 inter-token latency
If aggregate tokens/sec is low but p99 TTFT is high, you are prefill-bound – raise max-num-batched-tokens. If aggregate is low and TTFT is fine but inter-token latency drifts up, raise max-num-seqs. If you are seeing OOM, lower one or the other.
Pre-Tuned vLLM Serving
We hand off servers with vLLM parameters measured and set for your specific model.
Browse GPU ServersRecipe
For a 5090 serving Mistral 7B INT8, start here:
--max-num-seqs 128
--max-num-batched-tokens 8192
--gpu-memory-utilization 0.92
--enable-prefix-caching
For a 6000 Pro serving Llama 3 70B INT4, start:
--max-num-seqs 64
--max-num-batched-tokens 16384
--gpu-memory-utilization 0.93
--enable-prefix-caching
Measure. Adjust. See prefix caching gains, chunked prefill, and batch size on multi-GPU.