RTX 3050 - Order Now
Home / Blog / GPU Comparisons / Can RTX 3090 Run Qwen 72B?
GPU Comparisons

Can RTX 3090 Run Qwen 72B?

No, the RTX 3090 cannot run Qwen 72B. Even in INT4, the model needs ~40GB. Here is what you need instead and what Qwen models the 3090 can handle.

No, the RTX 3090 cannot run Qwen 72B. Even in aggressive INT4 quantisation, Qwen 72B requires approximately 40GB of VRAM for weights alone, far exceeding the RTX 3090’s 24GB capacity. For Qwen hosting at the 72B scale, you need multi-GPU configurations or high-VRAM datacenter cards.

The Short Answer

NO. Qwen 72B exceeds 24GB in every quantisation format. The RTX 3090 can run Qwen 7B and 14B instead.

Qwen 2.5 72B has 72.7 billion parameters. In FP16, this translates to roughly 145GB of VRAM. INT8 brings it to about 75GB, and INT4 reduces it to approximately 40GB. None of these fit within the RTX 3090’s 24GB. Even with aggressive 2-bit quantisation (which severely degrades quality), the model still needs around 22GB for weights with zero room for KV cache.

However, the RTX 3090 handles the Qwen 2.5 7B and 14B variants very well, and can run the 32B model in INT4 with partial offloading.

VRAM Analysis

Qwen ModelFP16 VRAMINT8 VRAMINT4 VRAMRTX 3090 (24GB)
Qwen 2.5 7B~15GB~8GB~5GBFP16 fits
Qwen 2.5 14B~29GB~15GB~9GBINT8 or INT4
Qwen 2.5 32B~65GB~34GB~18GBINT4 fits
Qwen 2.5 72B~145GB~75GB~40GBNo
Qwen 2.5 72B Q2_K~22GBNo (no KV room)

The Qwen 2.5 32B in INT4 is the largest Qwen model that fits on the RTX 3090, using about 18GB for weights and leaving 6GB for KV cache (roughly 4096 tokens of context). The 14B variant in INT8 is a better balanced option with more context headroom. Check our Qwen VRAM requirements guide for all configurations.

Performance Benchmarks

ModelGPUQuantisationTokens/secContext
Qwen 2.5 7BRTX 3090 (24GB)FP16~40 tok/s8192
Qwen 2.5 14BRTX 3090 (24GB)INT8~25 tok/s8192
Qwen 2.5 32BRTX 3090 (24GB)INT4~12 tok/s4096
Qwen 2.5 72B2x RTX 3090INT4~10 tok/s4096
Qwen 2.5 72BRTX 5090 (32GB)N/AStill too large

The Qwen 7B at 40 tok/s and 14B at 25 tok/s both deliver excellent interactive performance on the RTX 3090. The 32B model at 12 tok/s is usable but slower. For the 72B, even dual 3090s only achieve about 10 tok/s due to inter-GPU communication overhead. Full data on our benchmarks page.

Setup Guide

For the largest Qwen model the RTX 3090 can handle (32B in INT4):

# Ollama: Qwen 2.5 32B in Q4_K_M
ollama run qwen2.5:32b-instruct-q4_K_M

# vLLM: Qwen 2.5 14B in AWQ (better balanced)
pip install vllm
vllm serve Qwen/Qwen2.5-14B-Instruct-AWQ \
  --quantization awq \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --host 0.0.0.0 --port 8000

The 14B variant in AWQ is the sweet spot for the RTX 3090: good quality, fast inference, and generous context. For the 32B model, stick with Ollama or llama.cpp which handle the tight memory budget more gracefully than vLLM.

Running Qwen 72B requires multi-GPU setups. Two RTX 3090s with tensor parallelism via vLLM can handle it in INT4, or you can use our dedicated GPU servers with higher-VRAM datacenter GPUs. Even the RTX 5090 with 32GB cannot run the 72B solo.

If the 14B or 32B Qwen models meet your needs, the RTX 3090 is a great choice. For other models on this card, check whether it can run LLaMA 3 8B in FP16, run CodeLlama 34B, or run Mixtral 8x7B. For image generation combined with LLMs, see the SDXL plus LLM analysis. Our best GPU for LLM inference guide covers the full spectrum.

Deploy This Model Now

Dedicated GPU servers with the VRAM you need. UK datacenter, full root access.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?