RTX 3050 - Order Now
Home / Blog / Tutorials / vLLM Setup on the RTX 4090 24 GB: The Production Config
Tutorials

vLLM Setup on the RTX 4090 24 GB: The Production Config

The vLLM launch flags that actually matter on a 24 GB Ada Lovelace card. Tuned for the workloads the 4090 is good at, with FP16 fallback for missing FP8 hardware.

The RTX 4090 (Ada Lovelace) lacks native FP8 hardware. vLLM works great on it for FP16 and AWQ-INT4, but you need different flags than a Blackwell card. This is the production config we ship to 4090 customers.

TL;DR

Optimal RTX 4090 vLLM config: FP16 weights, FP16 KV cache, max-num-seqs=64, max-model-len=16384, gpu-memory-utilization=0.92, prefix caching enabled. ~1,100 tok/s aggregate on Mistral 7B. AWQ-INT4 for larger 13B-class models.

Install

sudo apt install -y python3.10-venv
python3.10 -m venv ~/vllm && source ~/vllm/bin/activate
pip install vllm==0.6.3

Driver requirement: NVIDIA 535+ for Ada. CUDA toolkit 12.1+.

Optimal launch config

vllm serve mistralai/Mistral-7B-Instruct-v0.3 \
  --host 0.0.0.0 --port 8000 \
  --max-model-len 16384 \
  --max-num-seqs 64 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --served-model-name mistral-7b

Why each flag

  • No --quantization: 4090 has no FP8 hardware, FP16 is the right default
  • --max-model-len 16384: 16K context is plenty for most chatbot workloads; 32K eats cache
  • --max-num-seqs 64: 24 GB VRAM lets you fit more concurrent batches than 16 GB cards
  • --gpu-memory-utilization 0.92: standard headroom
  • --enable-prefix-caching: free 30-50% throughput on chat workloads

For 13B-class models on the 4090:

vllm serve hugging-quants/Qwen2.5-14B-Instruct-AWQ-INT4 \
  --quantization awq_marlin \
  --max-model-len 16384 \
  --max-num-seqs 32 \
  --gpu-memory-utilization 0.92

Verdict

The 4090 is the right home for FP16 7B-class chatbots and AWQ-INT4 13B-14B. For FP8 paths use a Blackwell card.

Bottom line

This is the config we ship on customer 4090 deployments. See RTX 4090 spec breakdown.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?