Home / Blog / Tutorials / vLLM Setup on the RTX 4090 24 GB: The Production Config

Tutorials

vLLM Setup on the RTX 4090 24 GB: The Production Config

The vLLM launch flags that actually matter on a 24 GB Ada Lovelace card. Tuned for the workloads the 4090 is good at, with FP16 fallback for missing FP8 hardware.

Tutorials May 6, 2026 1 min read gigagpu

Table of Contents

The RTX 4090 (Ada Lovelace) lacks native FP8 hardware. vLLM works great on it for FP16 and AWQ-INT4, but you need different flags than a Blackwell card. This is the production config we ship to 4090 customers.

TL;DR

Optimal RTX 4090 vLLM config: FP16 weights, FP16 KV cache, max-num-seqs=64, max-model-len=16384, gpu-memory-utilization=0.92, prefix caching enabled. ~1,100 tok/s aggregate on Mistral 7B. AWQ-INT4 for larger 13B-class models.

Install

sudo apt install -y python3.10-venv
python3.10 -m venv ~/vllm && source ~/vllm/bin/activate
pip install vllm==0.6.3

Driver requirement: NVIDIA 535+ for Ada. CUDA toolkit 12.1+.

Optimal launch config

vllm serve mistralai/Mistral-7B-Instruct-v0.3 \
  --host 0.0.0.0 --port 8000 \
  --max-model-len 16384 \
  --max-num-seqs 64 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --served-model-name mistral-7b

Why each flag

No --quantization: 4090 has no FP8 hardware, FP16 is the right default
--max-model-len 16384: 16K context is plenty for most chatbot workloads; 32K eats cache
--max-num-seqs 64: 24 GB VRAM lets you fit more concurrent batches than 16 GB cards
--gpu-memory-utilization 0.92: standard headroom
--enable-prefix-caching: free 30-50% throughput on chat workloads

For 13B-class models on the 4090:

vllm serve hugging-quants/Qwen2.5-14B-Instruct-AWQ-INT4 \
  --quantization awq_marlin \
  --max-model-len 16384 \
  --max-num-seqs 32 \
  --gpu-memory-utilization 0.92

Verdict

The 4090 is the right home for FP16 7B-class chatbots and AWQ-INT4 13B-14B. For FP8 paths use a Blackwell card.

Bottom line

This is the config we ship on customer 4090 deployments. See RTX 4090 spec breakdown.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

vLLM Setup on the RTX 4090 24 GB: The Production Config

Install

Optimal launch config

Why each flag

Verdict

Bottom line

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

vLLM Setup on the RTX 4090 24 GB: The Production Config

Install

Optimal launch config

Why each flag

Verdict

Bottom line

Need a Dedicated GPU Server?

gigagpu

Related Articles

ELK Stack for AI Inference Logging

Whisper Timestamp Errors: Fix Guide

Fine-Tune LoRA on RTX 5060 Ti 16GB – Guide

Monitoring GPU Usage on a Dedicated Server: Tools, Metrics, and Alerts

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?