Table of Contents
Mixtral 8x7B and Consumer GPUs
Mixtral 8x7B is a Mixture-of-Experts model with 46.7 billion total parameters but only ~12.9 billion active per forward pass. At FP16 the full model weighs in at roughly 93 GB — far too large for any single consumer GPU. Quantisation is the only way to fit Mixtral on a dedicated GPU server with consumer cards. This guide shows exactly which formats work on which hardware, with real speed numbers.
For a primer on quantisation formats see our GPTQ vs AWQ vs GGUF guide. For a comparison of how MoE quantisation differs from dense models, the DeepSeek quantisation guide covers similar principles at larger scale.
VRAM Requirements by Format
All expert weights must be loaded into GPU memory even though only 2 of 8 experts are active per token. The table below shows total VRAM required at 4K context (batch size 1).
| Format | Model Size | VRAM (4K context) | Fits On |
|---|---|---|---|
| FP16 | 93 GB | ~96 GB | 2x RTX 5090 or 4x RTX 3090 |
| INT8 (GPTQ 8-bit) | 48 GB | ~51 GB | 2x RTX 3090 or RTX 5090 + RTX 3090 |
| GPTQ 4-bit | 26 GB | ~29 GB | RTX 5090 (32 GB) or 2x RTX 4060 Ti |
| AWQ 4-bit | 25 GB | ~28 GB | RTX 5090 (32 GB) or 2x RTX 4060 Ti |
| GGUF Q4_K_M | 27 GB | ~30 GB | RTX 5090 (32 GB) |
| GGUF Q3_K_M | 21 GB | ~24 GB | RTX 3090 (24 GB) |
INT4 quantisation brings Mixtral within range of a single RTX 5090 (32 GB). For the RTX 3090, the more aggressive Q3_K_M GGUF variant squeezes in at 24 GB with minimal headroom. For extended contexts, plan for additional VRAM — see our context length VRAM guide.
Speed Benchmarks
Measured with 512 input / 256 output tokens on GigaGPU servers. Multi-GPU configs use vLLM with tensor parallelism.
| GPU Config | FP16 (tok/s) | GPTQ 4-bit (tok/s) | AWQ 4-bit (tok/s) | GGUF Q4_K_M (tok/s) |
|---|---|---|---|---|
| RTX 3090 (24 GB) | N/A | N/A | N/A | 18 (Q3_K_M) |
| RTX 5090 (32 GB) | N/A | 38 | 36 | 30 |
| 2x RTX 3090 (48 GB) | N/A | 32 | 30 | 25 |
| 2x RTX 5090 (64 GB) | 45 | 58 | 55 | 46 |
| 4x RTX 3090 (96 GB) | 30 | 42 | 40 | 34 |
A single RTX 5090 with GPTQ 4-bit delivers 38 tok/s — fast enough for real-time chatbot applications. The MoE architecture means that inference speed is more limited by memory bandwidth than compute, making quantisation doubly beneficial: smaller weights transfer faster.
MoE-Specific Quantisation Tips
Quantising MoE models requires extra care compared to dense models:
- Router weights are sensitive: the gating network that selects experts should remain at higher precision (FP16 or INT8). Most quantisation tools handle this automatically, but verify your format preserves router precision.
- Expert quantisation tolerance: individual expert layers tolerate INT4 well because only 2 of 8 experts run per token — errors in unused experts do not affect output. This makes MoE models quantise better than similarly-sized dense models.
- Memory bandwidth bottleneck: with 8 expert weight matrices, Mixtral is severely memory-bandwidth limited. Quantisation from FP16 to INT4 nearly quadruples effective bandwidth utilisation, which is why INT4 speed gains are larger than typical dense models.
- KV cache is standard: the shared attention layers use normal GQA, so KV cache scales identically to a dense model. See our KV cache explainer for details.
Recommended GPU Configurations
| Budget | GPU Config | Best Format | Expected Speed |
|---|---|---|---|
| Entry | RTX 3090 (24 GB) | GGUF Q3_K_M | ~18 tok/s |
| Mid-range | RTX 5090 (32 GB) | GPTQ 4-bit | ~38 tok/s |
| Performance | 2x RTX 5090 (64 GB) | GPTQ 4-bit or FP16 | 45-58 tok/s |
| Maximum quality | 4x RTX 3090 (96 GB) | FP16 | ~30 tok/s |
For broader GPU comparisons, see our best GPU for LLM inference roundup. Browse all model guides in the Model Guides category.
Conclusion
Mixtral 8x7B is one of the best-value MoE models when quantised to INT4 — it delivers performance comparable to much larger dense models while fitting on a single 32 GB GPU. GPTQ 4-bit is the speed leader for GPU serving, while GGUF Q3_K_M is the only option for squeezing onto a 24 GB card. The MoE architecture actually quantises better than dense models, so you lose very little quality even at aggressive compression levels.
Run Mixtral 8x7B on Consumer GPUs
Dedicated GPU servers from a single RTX 3090 to multi-GPU clusters, configured for MoE model inference.
Browse GPU Servers