RTX 3050 - Order Now
Home / Blog / Tutorials / vLLM Deployment on the RTX 3090 24 GB: Production Recipe
Tutorials

vLLM Deployment on the RTX 3090 24 GB: Production Recipe

The vLLM launch flags that work on Ampere — no FP8 hardware path, but 24 GB VRAM lets you run FP16 models comfortably.

Table of Contents

  1. Install
  2. Config
  3. Verdict

The RTX 3090 (Ampere) is older but still a credible production AI host. The 24 GB VRAM matters more than the architecture age.

TL;DR

3090 vLLM config: FP16 weights, max-num-seqs=64, max-model-len=16384, gpu-memory-utilization=0.92, prefix caching. ~720 tok/s on Mistral 7B. No FP8 hardware so AWQ-INT4 for 13B-class models.

Install

pip install vllm==0.6.3
# RTX 3090 needs NVIDIA driver 535+ (Ampere baseline)

Config

vllm serve mistralai/Mistral-7B-Instruct-v0.3 \
  --max-model-len 16384 \
  --max-num-seqs 64 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching

For 13B models, switch to AWQ-INT4:

vllm serve hugging-quants/Qwen2.5-14B-Instruct-AWQ-INT4 \
  --quantization awq_marlin \
  --max-model-len 16384

Verdict

3090 is the cheapest 24 GB GPU for FP16 production. Skip if you need FP8 or 32+ GB.

Bottom line

Cheapest 24 GB. See RTX 3090 RAG guide.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?