RTX 3050 - Order Now
Home / Blog / Model Guides / Mixtral 8x22B on a Dedicated GPU
Model Guides

Mixtral 8x22B on a Dedicated GPU

Mistral's Mixtral 8x22B is a 141B total / 39B active MoE that needs serious VRAM - but quantised it fits a 96GB card with a useful speed advantage.

Mixtral 8x22B is Mistral’s largest open MoE – 141B total parameters, 39B active per token. On dedicated GPU hosting it needs a 6000 Pro 96GB or multi-GPU setup. The payoff is MoE’s speed profile: much faster than a dense 141B would be.

Contents

VRAM

PrecisionWeights
FP16~282 GB (multi-GPU only)
FP8~141 GB
AWQ INT4~75 GB

AWQ INT4 just fits on 96 GB with limited KV cache. For serious concurrency step up to dual 6000 Pros or accept reduced context.

Deployment

AWQ on a 6000 Pro:

python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mixtral-8x22B-Instruct-v0.1-AWQ \
  --quantization awq \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.93 \
  --max-num-seqs 8

Dual 6000 Pros with FP8:

--model mistralai/Mixtral-8x22B-Instruct-v0.1 \
--quantization fp8 \
--tensor-parallel-size 2 \
--max-model-len 32768

Speed

MoE shines on decode speed because only 2 of 8 experts activate per token. Effective compute per token is similar to a dense 39B model:

ConfigurationBatch 1 t/sBatch 8 t/s agg
6000 Pro AWQ INT4~35~180
2× 6000 Pro FP8~42~280

Compare to Llama 3.3 70B (dense) which is often slightly faster per token on the same hardware but scores lower on many benchmarks.

Flagship MoE Hosting

Mixtral 8x22B on UK dedicated 96GB or dual-GPU hardware.

Browse GPU Servers

See Mixtral 8x7B for the smaller MoE variant and Llama 3.3 70B as the dense alternative.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?