Home / Blog / GPU Comparisons / Can RTX 5090 Run Mixtral 8x7B?

GPU Comparisons

Can RTX 5090 Run Mixtral 8x7B?

Yes, the RTX 5090 can run Mixtral 8x7B in INT4 quantisation with 32GB VRAM. Here are the VRAM requirements, performance data, and setup steps.

GPU Comparisons April 14, 2026 3 min read gigagpu

Yes, the RTX 5090 can run Mixtral 8x7B in INT4 quantisation. With 32GB GDDR7 VRAM, the RTX 5090 fits the Mixtral 8x7B mixture-of-experts model when quantised to 4-bit. The full FP16 model requires approximately 90GB and will not fit, but INT4 variants run well on this card with good inference speed.

Table of Contents

The Short Answer
VRAM Analysis
Performance Benchmarks
Setup Guide
Recommended Alternative

The Short Answer

YES in INT4 (~26GB). NO in FP16 (~90GB) or INT8 (~48GB).

Mixtral 8x7B is a sparse mixture-of-experts model with 46.7B total parameters, of which roughly 12.9B are active per token (2 of 8 experts). Despite only activating a fraction of parameters during inference, ALL weights must be loaded into VRAM because the router dynamically selects which experts to use. In FP16, that means ~90GB for all weights. INT4 quantisation brings this to roughly 24-26GB. The RTX 5090 handles this with enough room for KV cache. For a detailed overview, see our best GPU for LLM inference guide.

VRAM Analysis

Configuration	Weights	KV Cache (4K ctx)	Total	RTX 5090 (32GB)
Mixtral 8x7B FP16	~90GB	~2GB	~92GB	No
Mixtral 8x7B INT8	~46GB	~2GB	~48GB	No
Mixtral 8x7B INT4 (AWQ)	~24GB	~2GB	~26GB	Fits
Mixtral 8x7B INT4 (GPTQ)	~25GB	~2GB	~27GB	Fits
Mixtral 8x7B INT4 (8K ctx)	~24GB	~4GB	~28GB	Fits

The AWQ format is recommended as it is slightly more compact and typically faster on consumer GPUs. Even at 8K context length, the model fits within 32GB with 4GB of headroom. The RTX 5090 is one of the few single-GPU options that makes Mixtral 8x7B practical.

Performance Benchmarks

GPU	Mixtral 8x7B INT4 (tok/s)	Notes
RTX 3090 (24GB)	N/A	Insufficient VRAM
RTX 5080 (16GB)	N/A	Insufficient VRAM
RTX 5090 (32GB)	~32-38	Single GPU
2x RTX 3090 (48GB)	~25-30	Tensor parallel

The RTX 5090 delivers 32-38 tokens per second with Mixtral 8x7B INT4, which is fast enough for real-time chat. The MoE architecture is inherently memory-bandwidth-bound during inference, and the 5090’s GDDR7 bandwidth provides a clear advantage over older cards in tensor-parallel configurations. For detailed comparisons, visit our tokens per second benchmark page.

Setup Guide

Deploy Mixtral 8x7B INT4 with vLLM for production serving:

# vLLM with AWQ quantised Mixtral
vllm serve TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ \
  --quantization awq \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.90 \
  --host 0.0.0.0 --port 8000

For quick local testing:

# Ollama with INT4 Mixtral
ollama run mixtral:8x7b-instruct-v0.1-q4_K_M

Start with 4096 context length and increase to 8192 if your use case requires it. Monitor VRAM usage with nvidia-smi as longer contexts increase KV cache consumption linearly.

Recommended Alternative

If Mixtral 8x7B in INT4 does not meet your quality bar, consider running the smaller Mistral 7B in FP16 on the 5090 for better per-token quality at faster speeds (~95 tok/s). Alternatively, LLaMA 3 70B INT4 on the 5090 provides similar reasoning quality without the MoE architecture.

For running multiple models simultaneously, see the RTX 5090 multi-LLM guide. For voice AI pipelines, check DeepSeek + Whisper on the 5090. For image generation, see the Flux.1 FP16 on 5090 analysis. Browse all options on our dedicated GPU hosting page or in the GPU Comparisons category.

Deploy This Model Now

Dedicated GPU servers with the VRAM you need. UK datacenter, full root access.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

GPU Comparisons

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Can RTX 5090 Run Mixtral 8x7B?

The Short Answer

VRAM Analysis

Performance Benchmarks

Setup Guide

Recommended Alternative

Deploy This Model Now

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Can RTX 5090 Run Mixtral 8x7B?

The Short Answer

VRAM Analysis

Performance Benchmarks

Setup Guide

Recommended Alternative

Deploy This Model Now

Need a Dedicated GPU Server?

gigagpu

Related Articles

Mistral 7B vs Qwen 2.5 7B for API Serving (Throughput): GPU Benchmark

Best GPU for Embedding Workloads in 2026

SDXL vs Flux.1 for API Serving (Throughput): GPU Benchmark

DeepSeek 7B vs Qwen 2.5 7B for Chatbot / Conversational AI: GPU Benchmark

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?