Home / Blog / Model Guides / Mixtral 8x22B on a Dedicated GPU

Model Guides

Mixtral 8x22B on a Dedicated GPU

Mistral's Mixtral 8x22B is a 141B total / 39B active MoE that needs serious VRAM - but quantised it fits a 96GB card with a useful speed advantage.

Model Guides April 19, 2026 1 min read admin

Mixtral 8x22B is Mistral’s largest open MoE – 141B total parameters, 39B active per token. On dedicated GPU hosting it needs a 6000 Pro 96GB or multi-GPU setup. The payoff is MoE’s speed profile: much faster than a dense 141B would be.

VRAM

Precision	Weights
FP16	~282 GB (multi-GPU only)
FP8	~141 GB
AWQ INT4	~75 GB

AWQ INT4 just fits on 96 GB with limited KV cache. For serious concurrency step up to dual 6000 Pros or accept reduced context.

Deployment

AWQ on a 6000 Pro:

python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mixtral-8x22B-Instruct-v0.1-AWQ \
  --quantization awq \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.93 \
  --max-num-seqs 8

Dual 6000 Pros with FP8:

--model mistralai/Mixtral-8x22B-Instruct-v0.1 \
--quantization fp8 \
--tensor-parallel-size 2 \
--max-model-len 32768

Speed

MoE shines on decode speed because only 2 of 8 experts activate per token. Effective compute per token is similar to a dense 39B model:

Configuration	Batch 1 t/s	Batch 8 t/s agg
6000 Pro AWQ INT4	~35	~180
2× 6000 Pro FP8	~42	~280

Compare to Llama 3.3 70B (dense) which is often slightly faster per token on the same hardware but scores lower on many benchmarks.

Flagship MoE Hosting

Mixtral 8x22B on UK dedicated 96GB or dual-GPU hardware.

Browse GPU Servers

See Mixtral 8x7B for the smaller MoE variant and Llama 3.3 70B as the dense alternative.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Model Guides

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Mixtral 8x22B on a Dedicated GPU

Contents

VRAM

Deployment

Speed

Flagship MoE Hosting

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Mixtral 8x22B on a Dedicated GPU

Contents

VRAM

Deployment

Speed

Flagship MoE Hosting

Need a Dedicated GPU Server?

admin

Related Articles

Coqui TTS for Multilingual Voice Synthesis: GPU Requirements & Setup

CodeLlama VRAM Requirements (7B, 13B, 34B)

LLaMA 3 8B vs 70B: When Do You Need the Bigger Model?

Run LLaMA 3 on RTX 5080 (Blackwell Performance)

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?