RTX 3050 - Order Now
Home / Blog / Tutorials / vLLM Engine Args Reference – What Each Flag Actually Does
Tutorials

vLLM Engine Args Reference – What Each Flag Actually Does

A compressed reference to the vLLM engine flags that matter in production, grouped by what they actually affect.

vLLM exposes dozens of flags and most users copy-paste a launch command without knowing what half of them do. This reference groups the flags that matter on dedicated GPU servers by what they affect so you can tune by intent rather than by forum consensus.

Flag Groups

Model Loading

FlagPurpose
--modelHugging Face path or local directory
--dtypehalf, bfloat16, float16, auto
--quantizationawq, gptq, fp8, bitsandbytes, none
--trust-remote-codeNeeded for some non-standard architectures
--revisionPin a specific HF commit/branch

Memory

FlagPurpose
--gpu-memory-utilizationFraction of VRAM to use. Default 0.9
--max-model-lenMax total sequence length
--swap-spaceCPU swap in GB (rarely useful)
--block-sizeKV cache block granule (8, 16, 32)
--kv-cache-dtypefp16 or fp8 for KV cache compression

Batching

FlagPurpose
--max-num-seqsMax concurrent sequences
--max-num-batched-tokensPrefill token budget per iteration
--enable-chunked-prefillSplit long prefills across steps
--enable-prefix-cachingReuse KV for repeated prefixes

Performance

FlagPurpose
--tensor-parallel-sizeSplit model across N GPUs
--speculative-modelDraft model for speculative decoding
--num-speculative-tokensTokens per speculation step
--enforce-eagerDisable CUDA graphs (use sparingly)
--use-v2-block-managerRequired for some features

Serving

FlagPurpose
--host, --portNetwork binding
--api-keyBearer token for authentication
--served-model-nameAlias the model name in API responses
--chat-templateOverride tokenizer chat template
--disable-log-requestsStop logging every request (use in production)

vLLM Preconfigured for Your Workload

We hand off UK dedicated GPU servers with vLLM flags tuned and tested.

Browse GPU Servers

Deeper dives on individual flags: continuous batching, chunked prefill, prefix caching, speculative decoding.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?