Mixtral 8x22B is Mistral’s largest open MoE – 141B total parameters, 39B active per token. On dedicated GPU hosting it needs a 6000 Pro 96GB or multi-GPU setup. The payoff is MoE’s speed profile: much faster than a dense 141B would be.
Contents
VRAM
| Precision | Weights |
|---|---|
| FP16 | ~282 GB (multi-GPU only) |
| FP8 | ~141 GB |
| AWQ INT4 | ~75 GB |
AWQ INT4 just fits on 96 GB with limited KV cache. For serious concurrency step up to dual 6000 Pros or accept reduced context.
Deployment
AWQ on a 6000 Pro:
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mixtral-8x22B-Instruct-v0.1-AWQ \
--quantization awq \
--max-model-len 16384 \
--gpu-memory-utilization 0.93 \
--max-num-seqs 8
Dual 6000 Pros with FP8:
--model mistralai/Mixtral-8x22B-Instruct-v0.1 \
--quantization fp8 \
--tensor-parallel-size 2 \
--max-model-len 32768
Speed
MoE shines on decode speed because only 2 of 8 experts activate per token. Effective compute per token is similar to a dense 39B model:
| Configuration | Batch 1 t/s | Batch 8 t/s agg |
|---|---|---|
| 6000 Pro AWQ INT4 | ~35 | ~180 |
| 2× 6000 Pro FP8 | ~42 | ~280 |
Compare to Llama 3.3 70B (dense) which is often slightly faster per token on the same hardware but scores lower on many benchmarks.
See Mixtral 8x7B for the smaller MoE variant and Llama 3.3 70B as the dense alternative.