Yes, the RTX 3090 can run Mixtral 8x7B in INT4 quantisation. With 24GB GDDR6X VRAM, the RTX 3090 fits this Mixture-of-Experts model when aggressively quantised, delivering usable inference for Mistral/Mixtral hosting. FP16 requires roughly 90GB and is out of reach for a single card, but INT4 brings it within the 3090’s capabilities.
The Short Answer
YES in INT4 quantisation with moderate context. NO in FP16 or INT8.
Mixtral 8x7B is a Mixture-of-Experts model with approximately 46.7B total parameters (though only ~13B are active per token due to the expert routing). In FP16, the full model needs roughly 90GB of VRAM. In INT8, that drops to about 47GB, still well beyond the RTX 3090’s 24GB. In INT4 (GPTQ/AWQ), the model compresses to approximately 24-26GB for weights, which is right at the card’s limit.
With GGUF Q4_K_M quantisation and partial CPU offloading, Mixtral 8x7B becomes practical on the 3090 with context windows up to 4096 tokens. The MoE architecture actually helps here since only 2 of 8 experts are active per token, keeping compute efficient even with the large parameter count.
VRAM Analysis
| Quantisation | Model VRAM | KV Cache (4K ctx) | Total | RTX 3090 (24GB) |
|---|---|---|---|---|
| FP16 | ~90GB | ~2.5GB | ~92.5GB | No |
| INT8 | ~47GB | ~2.5GB | ~49.5GB | No |
| INT4 (GPTQ) | ~26GB | ~2.5GB | ~28.5GB | Needs offloading |
| Q4_K_M (GGUF) | ~24GB | ~2.5GB | ~26.5GB | Partial offload |
| Q3_K_M (GGUF) | ~20GB | ~2.5GB | ~22.5GB | Fits |
The Q3_K_M quantisation fits entirely in VRAM with room for context, but quality degrades noticeably at 3-bit. Q4_K_M is the better quality option with some layers offloaded to system RAM. With fast DDR5 system memory, the offloading penalty is tolerable. See the Mixtral VRAM requirements guide for full details.
Performance Benchmarks
| GPU | Quantisation | Tokens/sec (output) | Context |
|---|---|---|---|
| RTX 3090 (24GB) | Q4_K_M (partial offload) | ~15 tok/s | 4096 |
| RTX 3090 (24GB) | Q3_K_M (full GPU) | ~20 tok/s | 4096 |
| RTX 5090 (32GB) | Q4_K_M (full GPU) | ~35 tok/s | 8192 |
| 2x RTX 3090 | INT8 | ~28 tok/s | 8192 |
At 15-20 tok/s, Mixtral 8x7B on the RTX 3090 is usable for interactive chat. The MoE routing means compute scales with active parameters (13B), not total parameters (47B), so throughput is better than you might expect from the model size. Full speed data is on our benchmarks page.
Setup Guide
llama.cpp via Ollama handles the partial offloading transparently:
# Ollama: Automatic quantisation and memory management
ollama run mixtral:8x7b-instruct-v0.1-q4_K_M
# For more control with llama.cpp directly
./llama-server \
-m mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf \
-ngl 28 \
-c 4096 \
--host 0.0.0.0 --port 8080
The -ngl 28 flag offloads 28 of the 32 layers to GPU, keeping 4 on CPU to stay within 24GB. Adjust this number based on your VRAM monitoring. With nvidia-smi showing usage around 22-23GB, you are at the optimal balance.
For vLLM with a pre-quantised GPTQ model:
vllm serve TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ \
--quantization gptq \
--max-model-len 4096 \
--gpu-memory-utilization 0.95 \
--host 0.0.0.0 --port 8000
Recommended Alternative
If you want Mixtral 8x7B with full VRAM residency and longer context, the RTX 5090 with 32GB fits the Q4_K_M model entirely in VRAM with room for 8K+ context. For FP16 or INT8 precision, dual-GPU setups through our dedicated GPU servers are the path forward.
If Mixtral is too large for your use case, consider running Mistral 7B on this card instead, which fits in FP16 with generous context. See whether the RTX 3090 can run LLaMA 3 8B in FP16 or check the RTX 3090 CodeLlama 34B guide for coding workloads. Our best GPU for LLM inference guide covers all options across the GPU range.
Deploy This Model Now
Dedicated GPU servers with the VRAM you need. UK datacenter, full root access.
Browse GPU Servers