No, the RTX 3090 cannot run Qwen 72B. Even in aggressive INT4 quantisation, Qwen 72B requires approximately 40GB of VRAM for weights alone, far exceeding the RTX 3090’s 24GB capacity. For Qwen hosting at the 72B scale, you need multi-GPU configurations or high-VRAM datacenter cards.
The Short Answer
NO. Qwen 72B exceeds 24GB in every quantisation format. The RTX 3090 can run Qwen 7B and 14B instead.
Qwen 2.5 72B has 72.7 billion parameters. In FP16, this translates to roughly 145GB of VRAM. INT8 brings it to about 75GB, and INT4 reduces it to approximately 40GB. None of these fit within the RTX 3090’s 24GB. Even with aggressive 2-bit quantisation (which severely degrades quality), the model still needs around 22GB for weights with zero room for KV cache.
However, the RTX 3090 handles the Qwen 2.5 7B and 14B variants very well, and can run the 32B model in INT4 with partial offloading.
VRAM Analysis
| Qwen Model | FP16 VRAM | INT8 VRAM | INT4 VRAM | RTX 3090 (24GB) |
|---|---|---|---|---|
| Qwen 2.5 7B | ~15GB | ~8GB | ~5GB | FP16 fits |
| Qwen 2.5 14B | ~29GB | ~15GB | ~9GB | INT8 or INT4 |
| Qwen 2.5 32B | ~65GB | ~34GB | ~18GB | INT4 fits |
| Qwen 2.5 72B | ~145GB | ~75GB | ~40GB | No |
| Qwen 2.5 72B Q2_K | – | – | ~22GB | No (no KV room) |
The Qwen 2.5 32B in INT4 is the largest Qwen model that fits on the RTX 3090, using about 18GB for weights and leaving 6GB for KV cache (roughly 4096 tokens of context). The 14B variant in INT8 is a better balanced option with more context headroom. Check our Qwen VRAM requirements guide for all configurations.
Performance Benchmarks
| Model | GPU | Quantisation | Tokens/sec | Context |
|---|---|---|---|---|
| Qwen 2.5 7B | RTX 3090 (24GB) | FP16 | ~40 tok/s | 8192 |
| Qwen 2.5 14B | RTX 3090 (24GB) | INT8 | ~25 tok/s | 8192 |
| Qwen 2.5 32B | RTX 3090 (24GB) | INT4 | ~12 tok/s | 4096 |
| Qwen 2.5 72B | 2x RTX 3090 | INT4 | ~10 tok/s | 4096 |
| Qwen 2.5 72B | RTX 5090 (32GB) | – | N/A | Still too large |
The Qwen 7B at 40 tok/s and 14B at 25 tok/s both deliver excellent interactive performance on the RTX 3090. The 32B model at 12 tok/s is usable but slower. For the 72B, even dual 3090s only achieve about 10 tok/s due to inter-GPU communication overhead. Full data on our benchmarks page.
Setup Guide
For the largest Qwen model the RTX 3090 can handle (32B in INT4):
# Ollama: Qwen 2.5 32B in Q4_K_M
ollama run qwen2.5:32b-instruct-q4_K_M
# vLLM: Qwen 2.5 14B in AWQ (better balanced)
pip install vllm
vllm serve Qwen/Qwen2.5-14B-Instruct-AWQ \
--quantization awq \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--host 0.0.0.0 --port 8000
The 14B variant in AWQ is the sweet spot for the RTX 3090: good quality, fast inference, and generous context. For the 32B model, stick with Ollama or llama.cpp which handle the tight memory budget more gracefully than vLLM.
Recommended Alternative
Running Qwen 72B requires multi-GPU setups. Two RTX 3090s with tensor parallelism via vLLM can handle it in INT4, or you can use our dedicated GPU servers with higher-VRAM datacenter GPUs. Even the RTX 5090 with 32GB cannot run the 72B solo.
If the 14B or 32B Qwen models meet your needs, the RTX 3090 is a great choice. For other models on this card, check whether it can run LLaMA 3 8B in FP16, run CodeLlama 34B, or run Mixtral 8x7B. For image generation combined with LLMs, see the SDXL plus LLM analysis. Our best GPU for LLM inference guide covers the full spectrum.
Deploy This Model Now
Dedicated GPU servers with the VRAM you need. UK datacenter, full root access.
Browse GPU Servers