vLLM exposes dozens of flags and most users copy-paste a launch command without knowing what half of them do. This reference groups the flags that matter on dedicated GPU servers by what they affect so you can tune by intent rather than by forum consensus.
Flag Groups
Model Loading
| Flag | Purpose |
|---|---|
--model | Hugging Face path or local directory |
--dtype | half, bfloat16, float16, auto |
--quantization | awq, gptq, fp8, bitsandbytes, none |
--trust-remote-code | Needed for some non-standard architectures |
--revision | Pin a specific HF commit/branch |
Memory
| Flag | Purpose |
|---|---|
--gpu-memory-utilization | Fraction of VRAM to use. Default 0.9 |
--max-model-len | Max total sequence length |
--swap-space | CPU swap in GB (rarely useful) |
--block-size | KV cache block granule (8, 16, 32) |
--kv-cache-dtype | fp16 or fp8 for KV cache compression |
Batching
| Flag | Purpose |
|---|---|
--max-num-seqs | Max concurrent sequences |
--max-num-batched-tokens | Prefill token budget per iteration |
--enable-chunked-prefill | Split long prefills across steps |
--enable-prefix-caching | Reuse KV for repeated prefixes |
Performance
| Flag | Purpose |
|---|---|
--tensor-parallel-size | Split model across N GPUs |
--speculative-model | Draft model for speculative decoding |
--num-speculative-tokens | Tokens per speculation step |
--enforce-eager | Disable CUDA graphs (use sparingly) |
--use-v2-block-manager | Required for some features |
Serving
| Flag | Purpose |
|---|---|
--host, --port | Network binding |
--api-key | Bearer token for authentication |
--served-model-name | Alias the model name in API responses |
--chat-template | Override tokenizer chat template |
--disable-log-requests | Stop logging every request (use in production) |
vLLM Preconfigured for Your Workload
We hand off UK dedicated GPU servers with vLLM flags tuned and tested.
Browse GPU ServersDeeper dives on individual flags: continuous batching, chunked prefill, prefix caching, speculative decoding.