Yes, the RTX 5090 can run Mixtral 8x7B in INT4 quantisation. With 32GB GDDR7 VRAM, the RTX 5090 fits the Mixtral 8x7B mixture-of-experts model when quantised to 4-bit. The full FP16 model requires approximately 90GB and will not fit, but INT4 variants run well on this card with good inference speed.
The Short Answer
YES in INT4 (~26GB). NO in FP16 (~90GB) or INT8 (~48GB).
Mixtral 8x7B is a sparse mixture-of-experts model with 46.7B total parameters, of which roughly 12.9B are active per token (2 of 8 experts). Despite only activating a fraction of parameters during inference, ALL weights must be loaded into VRAM because the router dynamically selects which experts to use. In FP16, that means ~90GB for all weights. INT4 quantisation brings this to roughly 24-26GB. The RTX 5090 handles this with enough room for KV cache. For a detailed overview, see our best GPU for LLM inference guide.
VRAM Analysis
| Configuration | Weights | KV Cache (4K ctx) | Total | RTX 5090 (32GB) |
|---|---|---|---|---|
| Mixtral 8x7B FP16 | ~90GB | ~2GB | ~92GB | No |
| Mixtral 8x7B INT8 | ~46GB | ~2GB | ~48GB | No |
| Mixtral 8x7B INT4 (AWQ) | ~24GB | ~2GB | ~26GB | Fits |
| Mixtral 8x7B INT4 (GPTQ) | ~25GB | ~2GB | ~27GB | Fits |
| Mixtral 8x7B INT4 (8K ctx) | ~24GB | ~4GB | ~28GB | Fits |
The AWQ format is recommended as it is slightly more compact and typically faster on consumer GPUs. Even at 8K context length, the model fits within 32GB with 4GB of headroom. The RTX 5090 is one of the few single-GPU options that makes Mixtral 8x7B practical.
Performance Benchmarks
| GPU | Mixtral 8x7B INT4 (tok/s) | Notes |
|---|---|---|
| RTX 3090 (24GB) | N/A | Insufficient VRAM |
| RTX 5080 (16GB) | N/A | Insufficient VRAM |
| RTX 5090 (32GB) | ~32-38 | Single GPU |
| 2x RTX 3090 (48GB) | ~25-30 | Tensor parallel |
The RTX 5090 delivers 32-38 tokens per second with Mixtral 8x7B INT4, which is fast enough for real-time chat. The MoE architecture is inherently memory-bandwidth-bound during inference, and the 5090’s GDDR7 bandwidth provides a clear advantage over older cards in tensor-parallel configurations. For detailed comparisons, visit our tokens per second benchmark page.
Setup Guide
Deploy Mixtral 8x7B INT4 with vLLM for production serving:
# vLLM with AWQ quantised Mixtral
vllm serve TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ \
--quantization awq \
--max-model-len 4096 \
--gpu-memory-utilization 0.90 \
--host 0.0.0.0 --port 8000
For quick local testing:
# Ollama with INT4 Mixtral
ollama run mixtral:8x7b-instruct-v0.1-q4_K_M
Start with 4096 context length and increase to 8192 if your use case requires it. Monitor VRAM usage with nvidia-smi as longer contexts increase KV cache consumption linearly.
Recommended Alternative
If Mixtral 8x7B in INT4 does not meet your quality bar, consider running the smaller Mistral 7B in FP16 on the 5090 for better per-token quality at faster speeds (~95 tok/s). Alternatively, LLaMA 3 70B INT4 on the 5090 provides similar reasoning quality without the MoE architecture.
For running multiple models simultaneously, see the RTX 5090 multi-LLM guide. For voice AI pipelines, check DeepSeek + Whisper on the 5090. For image generation, see the Flux.1 FP16 on 5090 analysis. Browse all options on our dedicated GPU hosting page or in the GPU Comparisons category.
Deploy This Model Now
Dedicated GPU servers with the VRAM you need. UK datacenter, full root access.
Browse GPU Servers