Home / Blog / Tutorials / vLLM Deployment on the RTX 3090 24 GB: Production Recipe

Tutorials

vLLM Deployment on the RTX 3090 24 GB: Production Recipe

The vLLM launch flags that work on Ampere — no FP8 hardware path, but 24 GB VRAM lets you run FP16 models comfortably.

Tutorials May 5, 2026 1 min read gigagpu

Table of Contents

The RTX 3090 (Ampere) is older but still a credible production AI host. The 24 GB VRAM matters more than the architecture age.

TL;DR

3090 vLLM config: FP16 weights, max-num-seqs=64, max-model-len=16384, gpu-memory-utilization=0.92, prefix caching. ~720 tok/s on Mistral 7B. No FP8 hardware so AWQ-INT4 for 13B-class models.

Install

pip install vllm==0.6.3
# RTX 3090 needs NVIDIA driver 535+ (Ampere baseline)

Config

vllm serve mistralai/Mistral-7B-Instruct-v0.3 \
  --max-model-len 16384 \
  --max-num-seqs 64 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching

For 13B models, switch to AWQ-INT4:

vllm serve hugging-quants/Qwen2.5-14B-Instruct-AWQ-INT4 \
  --quantization awq_marlin \
  --max-model-len 16384

Verdict

3090 is the cheapest 24 GB GPU for FP16 production. Skip if you need FP8 or 32+ GB.

Bottom line

Cheapest 24 GB. See RTX 3090 RAG guide.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

vLLM Deployment on the RTX 3090 24 GB: Production Recipe

Install

Config

Verdict

Bottom line

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

vLLM Deployment on the RTX 3090 24 GB: Production Recipe

Install

Config

Verdict

Bottom line

Need a Dedicated GPU Server?

gigagpu

Related Articles

Webhook Integration for AI Results

Unsloth Fine-Tuning on RTX 4060 Ti 16GB

Voice Agent Latency Optimization: From 1.5s to Sub-500ms

Connect VS Code to Self-Hosted Code Model on GPU

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?