Table of Contents
The RTX 4090 (Ada Lovelace) lacks native FP8 hardware. vLLM works great on it for FP16 and AWQ-INT4, but you need different flags than a Blackwell card. This is the production config we ship to 4090 customers.
Optimal RTX 4090 vLLM config: FP16 weights, FP16 KV cache, max-num-seqs=64, max-model-len=16384, gpu-memory-utilization=0.92, prefix caching enabled. ~1,100 tok/s aggregate on Mistral 7B. AWQ-INT4 for larger 13B-class models.
Install
sudo apt install -y python3.10-venv
python3.10 -m venv ~/vllm && source ~/vllm/bin/activate
pip install vllm==0.6.3
Driver requirement: NVIDIA 535+ for Ada. CUDA toolkit 12.1+.
Optimal launch config
vllm serve mistralai/Mistral-7B-Instruct-v0.3 \
--host 0.0.0.0 --port 8000 \
--max-model-len 16384 \
--max-num-seqs 64 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--served-model-name mistral-7b
Why each flag
- No
--quantization: 4090 has no FP8 hardware, FP16 is the right default --max-model-len 16384: 16K context is plenty for most chatbot workloads; 32K eats cache--max-num-seqs 64: 24 GB VRAM lets you fit more concurrent batches than 16 GB cards--gpu-memory-utilization 0.92: standard headroom--enable-prefix-caching: free 30-50% throughput on chat workloads
For 13B-class models on the 4090:
vllm serve hugging-quants/Qwen2.5-14B-Instruct-AWQ-INT4 \
--quantization awq_marlin \
--max-model-len 16384 \
--max-num-seqs 32 \
--gpu-memory-utilization 0.92
Verdict
The 4090 is the right home for FP16 7B-class chatbots and AWQ-INT4 13B-14B. For FP8 paths use a Blackwell card.
Bottom line
This is the config we ship on customer 4090 deployments. See RTX 4090 spec breakdown.