RTX 3050 - Order Now
Home / Blog / GPU Comparisons / Can RTX 5090 Run Mixtral 8x7B?
GPU Comparisons

Can RTX 5090 Run Mixtral 8x7B?

Yes, the RTX 5090 can run Mixtral 8x7B in INT4 quantisation with 32GB VRAM. Here are the VRAM requirements, performance data, and setup steps.

Yes, the RTX 5090 can run Mixtral 8x7B in INT4 quantisation. With 32GB GDDR7 VRAM, the RTX 5090 fits the Mixtral 8x7B mixture-of-experts model when quantised to 4-bit. The full FP16 model requires approximately 90GB and will not fit, but INT4 variants run well on this card with good inference speed.

The Short Answer

YES in INT4 (~26GB). NO in FP16 (~90GB) or INT8 (~48GB).

Mixtral 8x7B is a sparse mixture-of-experts model with 46.7B total parameters, of which roughly 12.9B are active per token (2 of 8 experts). Despite only activating a fraction of parameters during inference, ALL weights must be loaded into VRAM because the router dynamically selects which experts to use. In FP16, that means ~90GB for all weights. INT4 quantisation brings this to roughly 24-26GB. The RTX 5090 handles this with enough room for KV cache. For a detailed overview, see our best GPU for LLM inference guide.

VRAM Analysis

ConfigurationWeightsKV Cache (4K ctx)TotalRTX 5090 (32GB)
Mixtral 8x7B FP16~90GB~2GB~92GBNo
Mixtral 8x7B INT8~46GB~2GB~48GBNo
Mixtral 8x7B INT4 (AWQ)~24GB~2GB~26GBFits
Mixtral 8x7B INT4 (GPTQ)~25GB~2GB~27GBFits
Mixtral 8x7B INT4 (8K ctx)~24GB~4GB~28GBFits

The AWQ format is recommended as it is slightly more compact and typically faster on consumer GPUs. Even at 8K context length, the model fits within 32GB with 4GB of headroom. The RTX 5090 is one of the few single-GPU options that makes Mixtral 8x7B practical.

Performance Benchmarks

GPUMixtral 8x7B INT4 (tok/s)Notes
RTX 3090 (24GB)N/AInsufficient VRAM
RTX 5080 (16GB)N/AInsufficient VRAM
RTX 5090 (32GB)~32-38Single GPU
2x RTX 3090 (48GB)~25-30Tensor parallel

The RTX 5090 delivers 32-38 tokens per second with Mixtral 8x7B INT4, which is fast enough for real-time chat. The MoE architecture is inherently memory-bandwidth-bound during inference, and the 5090’s GDDR7 bandwidth provides a clear advantage over older cards in tensor-parallel configurations. For detailed comparisons, visit our tokens per second benchmark page.

Setup Guide

Deploy Mixtral 8x7B INT4 with vLLM for production serving:

# vLLM with AWQ quantised Mixtral
vllm serve TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ \
  --quantization awq \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.90 \
  --host 0.0.0.0 --port 8000

For quick local testing:

# Ollama with INT4 Mixtral
ollama run mixtral:8x7b-instruct-v0.1-q4_K_M

Start with 4096 context length and increase to 8192 if your use case requires it. Monitor VRAM usage with nvidia-smi as longer contexts increase KV cache consumption linearly.

If Mixtral 8x7B in INT4 does not meet your quality bar, consider running the smaller Mistral 7B in FP16 on the 5090 for better per-token quality at faster speeds (~95 tok/s). Alternatively, LLaMA 3 70B INT4 on the 5090 provides similar reasoning quality without the MoE architecture.

For running multiple models simultaneously, see the RTX 5090 multi-LLM guide. For voice AI pipelines, check DeepSeek + Whisper on the 5090. For image generation, see the Flux.1 FP16 on 5090 analysis. Browse all options on our dedicated GPU hosting page or in the GPU Comparisons category.

Deploy This Model Now

Dedicated GPU servers with the VRAM you need. UK datacenter, full root access.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?