RTX 3050 - Order Now
Home / Blog / GPU Comparisons / Can RTX 3090 Run Mixtral 8x7B?
GPU Comparisons

Can RTX 3090 Run Mixtral 8x7B?

The RTX 3090 can run Mixtral 8x7B in INT4 quantisation with its 24GB VRAM. FP16 requires multi-GPU. Full VRAM breakdown and benchmarks inside.

Yes, the RTX 3090 can run Mixtral 8x7B in INT4 quantisation. With 24GB GDDR6X VRAM, the RTX 3090 fits this Mixture-of-Experts model when aggressively quantised, delivering usable inference for Mistral/Mixtral hosting. FP16 requires roughly 90GB and is out of reach for a single card, but INT4 brings it within the 3090’s capabilities.

The Short Answer

YES in INT4 quantisation with moderate context. NO in FP16 or INT8.

Mixtral 8x7B is a Mixture-of-Experts model with approximately 46.7B total parameters (though only ~13B are active per token due to the expert routing). In FP16, the full model needs roughly 90GB of VRAM. In INT8, that drops to about 47GB, still well beyond the RTX 3090’s 24GB. In INT4 (GPTQ/AWQ), the model compresses to approximately 24-26GB for weights, which is right at the card’s limit.

With GGUF Q4_K_M quantisation and partial CPU offloading, Mixtral 8x7B becomes practical on the 3090 with context windows up to 4096 tokens. The MoE architecture actually helps here since only 2 of 8 experts are active per token, keeping compute efficient even with the large parameter count.

VRAM Analysis

QuantisationModel VRAMKV Cache (4K ctx)TotalRTX 3090 (24GB)
FP16~90GB~2.5GB~92.5GBNo
INT8~47GB~2.5GB~49.5GBNo
INT4 (GPTQ)~26GB~2.5GB~28.5GBNeeds offloading
Q4_K_M (GGUF)~24GB~2.5GB~26.5GBPartial offload
Q3_K_M (GGUF)~20GB~2.5GB~22.5GBFits

The Q3_K_M quantisation fits entirely in VRAM with room for context, but quality degrades noticeably at 3-bit. Q4_K_M is the better quality option with some layers offloaded to system RAM. With fast DDR5 system memory, the offloading penalty is tolerable. See the Mixtral VRAM requirements guide for full details.

Performance Benchmarks

GPUQuantisationTokens/sec (output)Context
RTX 3090 (24GB)Q4_K_M (partial offload)~15 tok/s4096
RTX 3090 (24GB)Q3_K_M (full GPU)~20 tok/s4096
RTX 5090 (32GB)Q4_K_M (full GPU)~35 tok/s8192
2x RTX 3090INT8~28 tok/s8192

At 15-20 tok/s, Mixtral 8x7B on the RTX 3090 is usable for interactive chat. The MoE routing means compute scales with active parameters (13B), not total parameters (47B), so throughput is better than you might expect from the model size. Full speed data is on our benchmarks page.

Setup Guide

llama.cpp via Ollama handles the partial offloading transparently:

# Ollama: Automatic quantisation and memory management
ollama run mixtral:8x7b-instruct-v0.1-q4_K_M

# For more control with llama.cpp directly
./llama-server \
  -m mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf \
  -ngl 28 \
  -c 4096 \
  --host 0.0.0.0 --port 8080

The -ngl 28 flag offloads 28 of the 32 layers to GPU, keeping 4 on CPU to stay within 24GB. Adjust this number based on your VRAM monitoring. With nvidia-smi showing usage around 22-23GB, you are at the optimal balance.

For vLLM with a pre-quantised GPTQ model:

vllm serve TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ \
  --quantization gptq \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.95 \
  --host 0.0.0.0 --port 8000

If you want Mixtral 8x7B with full VRAM residency and longer context, the RTX 5090 with 32GB fits the Q4_K_M model entirely in VRAM with room for 8K+ context. For FP16 or INT8 precision, dual-GPU setups through our dedicated GPU servers are the path forward.

If Mixtral is too large for your use case, consider running Mistral 7B on this card instead, which fits in FP16 with generous context. See whether the RTX 3090 can run LLaMA 3 8B in FP16 or check the RTX 3090 CodeLlama 34B guide for coding workloads. Our best GPU for LLM inference guide covers all options across the GPU range.

Deploy This Model Now

Dedicated GPU servers with the VRAM you need. UK datacenter, full root access.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?