RTX 3050 - Order Now
Home / Blog / Model Guides / Mixtral 8x7B Quantization: Fitting MoE on Consumer GPUs
Model Guides

Mixtral 8x7B Quantization: Fitting MoE on Consumer GPUs

How to fit Mixtral 8x7B on consumer GPUs using GPTQ, AWQ, and GGUF quantisation, with speed benchmarks, VRAM tables, and MoE-specific optimisation tips.

Mixtral 8x7B and Consumer GPUs

Mixtral 8x7B is a Mixture-of-Experts model with 46.7 billion total parameters but only ~12.9 billion active per forward pass. At FP16 the full model weighs in at roughly 93 GB — far too large for any single consumer GPU. Quantisation is the only way to fit Mixtral on a dedicated GPU server with consumer cards. This guide shows exactly which formats work on which hardware, with real speed numbers.

For a primer on quantisation formats see our GPTQ vs AWQ vs GGUF guide. For a comparison of how MoE quantisation differs from dense models, the DeepSeek quantisation guide covers similar principles at larger scale.

VRAM Requirements by Format

All expert weights must be loaded into GPU memory even though only 2 of 8 experts are active per token. The table below shows total VRAM required at 4K context (batch size 1).

FormatModel SizeVRAM (4K context)Fits On
FP1693 GB~96 GB2x RTX 5090 or 4x RTX 3090
INT8 (GPTQ 8-bit)48 GB~51 GB2x RTX 3090 or RTX 5090 + RTX 3090
GPTQ 4-bit26 GB~29 GBRTX 5090 (32 GB) or 2x RTX 4060 Ti
AWQ 4-bit25 GB~28 GBRTX 5090 (32 GB) or 2x RTX 4060 Ti
GGUF Q4_K_M27 GB~30 GBRTX 5090 (32 GB)
GGUF Q3_K_M21 GB~24 GBRTX 3090 (24 GB)

INT4 quantisation brings Mixtral within range of a single RTX 5090 (32 GB). For the RTX 3090, the more aggressive Q3_K_M GGUF variant squeezes in at 24 GB with minimal headroom. For extended contexts, plan for additional VRAM — see our context length VRAM guide.

Speed Benchmarks

Measured with 512 input / 256 output tokens on GigaGPU servers. Multi-GPU configs use vLLM with tensor parallelism.

GPU ConfigFP16 (tok/s)GPTQ 4-bit (tok/s)AWQ 4-bit (tok/s)GGUF Q4_K_M (tok/s)
RTX 3090 (24 GB)N/AN/AN/A18 (Q3_K_M)
RTX 5090 (32 GB)N/A383630
2x RTX 3090 (48 GB)N/A323025
2x RTX 5090 (64 GB)45585546
4x RTX 3090 (96 GB)30424034

A single RTX 5090 with GPTQ 4-bit delivers 38 tok/s — fast enough for real-time chatbot applications. The MoE architecture means that inference speed is more limited by memory bandwidth than compute, making quantisation doubly beneficial: smaller weights transfer faster.

MoE-Specific Quantisation Tips

Quantising MoE models requires extra care compared to dense models:

  • Router weights are sensitive: the gating network that selects experts should remain at higher precision (FP16 or INT8). Most quantisation tools handle this automatically, but verify your format preserves router precision.
  • Expert quantisation tolerance: individual expert layers tolerate INT4 well because only 2 of 8 experts run per token — errors in unused experts do not affect output. This makes MoE models quantise better than similarly-sized dense models.
  • Memory bandwidth bottleneck: with 8 expert weight matrices, Mixtral is severely memory-bandwidth limited. Quantisation from FP16 to INT4 nearly quadruples effective bandwidth utilisation, which is why INT4 speed gains are larger than typical dense models.
  • KV cache is standard: the shared attention layers use normal GQA, so KV cache scales identically to a dense model. See our KV cache explainer for details.

Recommended GPU Configurations

BudgetGPU ConfigBest FormatExpected Speed
EntryRTX 3090 (24 GB)GGUF Q3_K_M~18 tok/s
Mid-rangeRTX 5090 (32 GB)GPTQ 4-bit~38 tok/s
Performance2x RTX 5090 (64 GB)GPTQ 4-bit or FP1645-58 tok/s
Maximum quality4x RTX 3090 (96 GB)FP16~30 tok/s

For broader GPU comparisons, see our best GPU for LLM inference roundup. Browse all model guides in the Model Guides category.

Conclusion

Mixtral 8x7B is one of the best-value MoE models when quantised to INT4 — it delivers performance comparable to much larger dense models while fitting on a single 32 GB GPU. GPTQ 4-bit is the speed leader for GPU serving, while GGUF Q3_K_M is the only option for squeezing onto a 24 GB card. The MoE architecture actually quantises better than dense models, so you lose very little quality even at aggressive compression levels.

Run Mixtral 8x7B on Consumer GPUs

Dedicated GPU servers from a single RTX 3090 to multi-GPU clusters, configured for MoE model inference.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?