RTX 3050 - Order Now
Home / Blog / Tutorials / vLLM Slow Throughput: Optimization Checklist
Tutorials

vLLM Slow Throughput: Optimization Checklist

Diagnose and fix slow vLLM throughput. Covers KV cache sizing, batch configuration, quantization, tensor parallelism tuning, and benchmark verification for production inference servers.

Symptom: vLLM Is Not Hitting Expected Throughput

Your vLLM server is running, requests complete, but throughput is disappointingly low. An 8B model on an RTX 6000 Pro should process hundreds of tokens per second, but you are seeing a fraction of that. Or worse, throughput degrades as concurrent users increase.

vLLM’s PagedAttention architecture is designed for high throughput, but only when configured correctly. Out-of-the-box settings are conservative. This checklist walks through every tunable parameter that affects tokens-per-second on a dedicated GPU server.

Step 1: Measure Your Baseline

Before optimising, establish numbers:

# Install the benchmark tool
pip install vllm

# Run the benchmark
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct &

# Benchmark with concurrent requests
python -m vllm.benchmarks.benchmark_serving \
  --backend openai \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --num-prompts 100 \
  --request-rate 10

Record three metrics: tokens per second (throughput), time to first token (TTFT), and inter-token latency (ITL).

Step 2: Check GPU Utilisation

watch -n 1 nvidia-smi

If GPU utilisation is below 80 percent during active inference, the GPU is being starved. Common reasons: insufficient batch size, CPU bottleneck in preprocessing, or network I/O blocking the request queue. Our GPU monitoring guide covers detailed profiling.

Step 3: Maximise KV Cache

More KV cache blocks means more concurrent sequences, which means better batching:

--gpu-memory-utilization 0.95 \
--max-model-len 4096

Reducing max-model-len if your workload does not need long contexts dramatically increases the number of requests that can be processed simultaneously. See our vLLM memory and throughput guide for detailed KV cache calculations.

Step 4: Enable Prefix Caching

If many requests share a common system prompt:

--enable-prefix-caching

Prefix caching stores shared prompt KV values once and reuses them across requests. For chatbot applications where every request includes the same system instructions, this can halve the per-request compute cost.

Step 5: Tune Batching Parameters

--max-num-seqs 256 \
--max-num-batched-tokens 32768

max-num-seqs controls maximum concurrent sequences. max-num-batched-tokens caps the total tokens processed per iteration. Higher values improve throughput at the cost of latency. For throughput-first workloads on your GPU server, push these up until VRAM limits become the constraint.

Step 6: Consider Quantization for Throughput

Counterintuitively, quantized models can be faster than FP16 because they are memory-bandwidth-bound:

--model TheBloke/Meta-Llama-3.1-8B-Instruct-AWQ \
--quantization awq

AWQ 4-bit models use one quarter of the memory bandwidth per parameter, which directly improves throughput on bandwidth-limited GPUs. The quality trade-off is minimal for most inference tasks.

Step 7: Multi-GPU Scaling

Tensor parallelism across GPUs increases throughput for large models:

--tensor-parallel-size 2

However, for smaller models that fit on a single GPU, running two separate vLLM instances (one per GPU) with a load balancer gives better aggregate throughput than tensor parallelism. Tensor parallelism adds inter-GPU communication overhead that is only worth it when the model does not fit on one card.

Step 8: Verify Improvements

Re-run the benchmark after each change:

python -m vllm.benchmarks.benchmark_serving \
  --backend openai \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --num-prompts 500 \
  --request-rate 50

Compare throughput, TTFT, and ITL against your baseline. Optimisations should show clear, measurable gains. If throughput plateaus, you have hit a hardware limit — either GPU compute, memory bandwidth, or PCIe throughput. Consider upgrading to a higher-tier GPU server.

Production-Optimised Configuration

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --gpu-memory-utilization 0.95 \
  --max-model-len 4096 \
  --max-num-seqs 256 \
  --enable-prefix-caching \
  --dtype half \
  --port 8000

Protect this endpoint with API security, set up production service management, and configure continuous monitoring for sustained throughput tracking. For PyTorch workloads that need custom inference logic beyond vLLM, similar batching and memory principles apply.

High-Throughput GPU Servers

GigaGPU servers feature enterprise GPUs with the memory bandwidth that vLLM needs for maximum token throughput.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?