Home / Blog / Tutorials / vLLM Continuous Batching Tuning Guide

Tutorials

vLLM Continuous Batching Tuning Guide

The three knobs that actually move vLLM throughput, how to measure their effect, and a tuning recipe for common workloads.

Tutorials April 19, 2026 2 min read admin

vLLM continuous batching is the feature that makes serving economics viable. It is also the feature most teams leave poorly-tuned. On dedicated GPU servers, a few hours of tuning typically returns 30-50% more throughput from the same hardware. Here is the recipe.

What Continuous Batching Does

Without continuous batching, a serving engine picks a batch of requests, runs them to completion, then picks the next batch. If requests have very different output lengths, the short ones wait for the long ones. Continuous batching lets finished sequences drop out and new sequences join in mid-flight. Utilisation goes way up.

The Knobs

--max-num-seqs 128
--max-num-batched-tokens 16384
--gpu-memory-utilization 0.92

–max-num-seqs: maximum concurrent sequences. Higher = more concurrency but more KV cache pressure. Starts at 128 default; often right at 32-256 depending on model and GPU.

–max-num-batched-tokens: cap on prefill tokens per iteration. Large prompts get chunked across steps. Higher values speed up long-prompt workloads but risk starving decode.

–gpu-memory-utilization: fraction of VRAM vLLM will use. Default 0.9. Raise to 0.93-0.95 if you know the exact VRAM footprint of other workloads.

How to Measure

Run a load test with realistic prompt and completion distributions. Record four metrics:

Requests per second (RPS) at saturation
Tokens per second aggregate
p50 and p99 time-to-first-token
p50 and p99 inter-token latency

If aggregate tokens/sec is low but p99 TTFT is high, you are prefill-bound – raise max-num-batched-tokens. If aggregate is low and TTFT is fine but inter-token latency drifts up, raise max-num-seqs. If you are seeing OOM, lower one or the other.

Pre-Tuned vLLM Serving

We hand off servers with vLLM parameters measured and set for your specific model.

Browse GPU Servers

Recipe

For a 5090 serving Mistral 7B INT8, start here:

--max-num-seqs 128
--max-num-batched-tokens 8192
--gpu-memory-utilization 0.92
--enable-prefix-caching

For a 6000 Pro serving Llama 3 70B INT4, start:

--max-num-seqs 64
--max-num-batched-tokens 16384
--gpu-memory-utilization 0.93
--enable-prefix-caching

Measure. Adjust. See prefix caching gains, chunked prefill, and batch size on multi-GPU.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

vLLM Continuous Batching Tuning Guide

Contents

What Continuous Batching Does

The Knobs

How to Measure

Pre-Tuned vLLM Serving

Recipe

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

vLLM Continuous Batching Tuning Guide

Contents

What Continuous Batching Does

The Knobs

How to Measure

Pre-Tuned vLLM Serving

Recipe

Need a Dedicated GPU Server?

admin

Related Articles

Subtitle Generator Pipeline with Whisper and SRT

Connect Kafka to AI Streaming on GPU

Connect AWS S3 to GPU for Models

Email Classifier Pipeline with LLM and IMAP

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?