RTX 3050 - Order Now
Home / Blog / Tutorials / vLLM Continuous Batching Tuning Guide
Tutorials

vLLM Continuous Batching Tuning Guide

The three knobs that actually move vLLM throughput, how to measure their effect, and a tuning recipe for common workloads.

vLLM continuous batching is the feature that makes serving economics viable. It is also the feature most teams leave poorly-tuned. On dedicated GPU servers, a few hours of tuning typically returns 30-50% more throughput from the same hardware. Here is the recipe.

Contents

What Continuous Batching Does

Without continuous batching, a serving engine picks a batch of requests, runs them to completion, then picks the next batch. If requests have very different output lengths, the short ones wait for the long ones. Continuous batching lets finished sequences drop out and new sequences join in mid-flight. Utilisation goes way up.

The Knobs

--max-num-seqs 128
--max-num-batched-tokens 16384
--gpu-memory-utilization 0.92

–max-num-seqs: maximum concurrent sequences. Higher = more concurrency but more KV cache pressure. Starts at 128 default; often right at 32-256 depending on model and GPU.

–max-num-batched-tokens: cap on prefill tokens per iteration. Large prompts get chunked across steps. Higher values speed up long-prompt workloads but risk starving decode.

–gpu-memory-utilization: fraction of VRAM vLLM will use. Default 0.9. Raise to 0.93-0.95 if you know the exact VRAM footprint of other workloads.

How to Measure

Run a load test with realistic prompt and completion distributions. Record four metrics:

  • Requests per second (RPS) at saturation
  • Tokens per second aggregate
  • p50 and p99 time-to-first-token
  • p50 and p99 inter-token latency

If aggregate tokens/sec is low but p99 TTFT is high, you are prefill-bound – raise max-num-batched-tokens. If aggregate is low and TTFT is fine but inter-token latency drifts up, raise max-num-seqs. If you are seeing OOM, lower one or the other.

Pre-Tuned vLLM Serving

We hand off servers with vLLM parameters measured and set for your specific model.

Browse GPU Servers

Recipe

For a 5090 serving Mistral 7B INT8, start here:

--max-num-seqs 128
--max-num-batched-tokens 8192
--gpu-memory-utilization 0.92
--enable-prefix-caching

For a 6000 Pro serving Llama 3 70B INT4, start:

--max-num-seqs 64
--max-num-batched-tokens 16384
--gpu-memory-utilization 0.93
--enable-prefix-caching

Measure. Adjust. See prefix caching gains, chunked prefill, and batch size on multi-GPU.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?