Symptom: vLLM Is Not Hitting Expected Throughput
Your vLLM server is running, requests complete, but throughput is disappointingly low. An 8B model on an RTX 6000 Pro should process hundreds of tokens per second, but you are seeing a fraction of that. Or worse, throughput degrades as concurrent users increase.
vLLM’s PagedAttention architecture is designed for high throughput, but only when configured correctly. Out-of-the-box settings are conservative. This checklist walks through every tunable parameter that affects tokens-per-second on a dedicated GPU server.
Step 1: Measure Your Baseline
Before optimising, establish numbers:
# Install the benchmark tool
pip install vllm
# Run the benchmark
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct &
# Benchmark with concurrent requests
python -m vllm.benchmarks.benchmark_serving \
--backend openai \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--num-prompts 100 \
--request-rate 10
Record three metrics: tokens per second (throughput), time to first token (TTFT), and inter-token latency (ITL).
Step 2: Check GPU Utilisation
watch -n 1 nvidia-smi
If GPU utilisation is below 80 percent during active inference, the GPU is being starved. Common reasons: insufficient batch size, CPU bottleneck in preprocessing, or network I/O blocking the request queue. Our GPU monitoring guide covers detailed profiling.
Step 3: Maximise KV Cache
More KV cache blocks means more concurrent sequences, which means better batching:
--gpu-memory-utilization 0.95 \
--max-model-len 4096
Reducing max-model-len if your workload does not need long contexts dramatically increases the number of requests that can be processed simultaneously. See our vLLM memory and throughput guide for detailed KV cache calculations.
Step 4: Enable Prefix Caching
If many requests share a common system prompt:
--enable-prefix-caching
Prefix caching stores shared prompt KV values once and reuses them across requests. For chatbot applications where every request includes the same system instructions, this can halve the per-request compute cost.
Step 5: Tune Batching Parameters
--max-num-seqs 256 \
--max-num-batched-tokens 32768
max-num-seqs controls maximum concurrent sequences. max-num-batched-tokens caps the total tokens processed per iteration. Higher values improve throughput at the cost of latency. For throughput-first workloads on your GPU server, push these up until VRAM limits become the constraint.
Step 6: Consider Quantization for Throughput
Counterintuitively, quantized models can be faster than FP16 because they are memory-bandwidth-bound:
--model TheBloke/Meta-Llama-3.1-8B-Instruct-AWQ \
--quantization awq
AWQ 4-bit models use one quarter of the memory bandwidth per parameter, which directly improves throughput on bandwidth-limited GPUs. The quality trade-off is minimal for most inference tasks.
Step 7: Multi-GPU Scaling
Tensor parallelism across GPUs increases throughput for large models:
--tensor-parallel-size 2
However, for smaller models that fit on a single GPU, running two separate vLLM instances (one per GPU) with a load balancer gives better aggregate throughput than tensor parallelism. Tensor parallelism adds inter-GPU communication overhead that is only worth it when the model does not fit on one card.
Step 8: Verify Improvements
Re-run the benchmark after each change:
python -m vllm.benchmarks.benchmark_serving \
--backend openai \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--num-prompts 500 \
--request-rate 50
Compare throughput, TTFT, and ITL against your baseline. Optimisations should show clear, measurable gains. If throughput plateaus, you have hit a hardware limit — either GPU compute, memory bandwidth, or PCIe throughput. Consider upgrading to a higher-tier GPU server.
Production-Optimised Configuration
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--gpu-memory-utilization 0.95 \
--max-model-len 4096 \
--max-num-seqs 256 \
--enable-prefix-caching \
--dtype half \
--port 8000
Protect this endpoint with API security, set up production service management, and configure continuous monitoring for sustained throughput tracking. For PyTorch workloads that need custom inference logic beyond vLLM, similar batching and memory principles apply.
High-Throughput GPU Servers
GigaGPU servers feature enterprise GPUs with the memory bandwidth that vLLM needs for maximum token throughput.
Browse GPU Servers