Home / Blog / Model Guides / Mixtral 8x7B Quantization: Fitting MoE on Consumer GPUs

Model Guides

Mixtral 8x7B Quantization: Fitting MoE on Consumer GPUs

How to fit Mixtral 8x7B on consumer GPUs using GPTQ, AWQ, and GGUF quantisation, with speed benchmarks, VRAM tables, and MoE-specific optimisation tips.

Model Guides April 17, 2026 3 min read gigagpu

Table of Contents

Mixtral 8x7B and Consumer GPUs
VRAM Requirements by Format
Speed Benchmarks
MoE-Specific Quantisation Tips
Recommended GPU Configurations
Conclusion

Mixtral 8x7B and Consumer GPUs

Mixtral 8x7B is a Mixture-of-Experts model with 46.7 billion total parameters but only ~12.9 billion active per forward pass. At FP16 the full model weighs in at roughly 93 GB — far too large for any single consumer GPU. Quantisation is the only way to fit Mixtral on a dedicated GPU server with consumer cards. This guide shows exactly which formats work on which hardware, with real speed numbers.

For a primer on quantisation formats see our GPTQ vs AWQ vs GGUF guide. For a comparison of how MoE quantisation differs from dense models, the DeepSeek quantisation guide covers similar principles at larger scale.

VRAM Requirements by Format

All expert weights must be loaded into GPU memory even though only 2 of 8 experts are active per token. The table below shows total VRAM required at 4K context (batch size 1).

Format	Model Size	VRAM (4K context)	Fits On
FP16	93 GB	~96 GB	2x RTX 5090 or 4x RTX 3090
INT8 (GPTQ 8-bit)	48 GB	~51 GB	2x RTX 3090 or RTX 5090 + RTX 3090
GPTQ 4-bit	26 GB	~29 GB	RTX 5090 (32 GB) or 2x RTX 4060 Ti
AWQ 4-bit	25 GB	~28 GB	RTX 5090 (32 GB) or 2x RTX 4060 Ti
GGUF Q4_K_M	27 GB	~30 GB	RTX 5090 (32 GB)
GGUF Q3_K_M	21 GB	~24 GB	RTX 3090 (24 GB)

INT4 quantisation brings Mixtral within range of a single RTX 5090 (32 GB). For the RTX 3090, the more aggressive Q3_K_M GGUF variant squeezes in at 24 GB with minimal headroom. For extended contexts, plan for additional VRAM — see our context length VRAM guide.

Speed Benchmarks

Measured with 512 input / 256 output tokens on GigaGPU servers. Multi-GPU configs use vLLM with tensor parallelism.

GPU Config	FP16 (tok/s)	GPTQ 4-bit (tok/s)	AWQ 4-bit (tok/s)	GGUF Q4_K_M (tok/s)
RTX 3090 (24 GB)	N/A	N/A	N/A	18 (Q3_K_M)
RTX 5090 (32 GB)	N/A	38	36	30
2x RTX 3090 (48 GB)	N/A	32	30	25
2x RTX 5090 (64 GB)	45	58	55	46
4x RTX 3090 (96 GB)	30	42	40	34

A single RTX 5090 with GPTQ 4-bit delivers 38 tok/s — fast enough for real-time chatbot applications. The MoE architecture means that inference speed is more limited by memory bandwidth than compute, making quantisation doubly beneficial: smaller weights transfer faster.

MoE-Specific Quantisation Tips

Quantising MoE models requires extra care compared to dense models:

Router weights are sensitive: the gating network that selects experts should remain at higher precision (FP16 or INT8). Most quantisation tools handle this automatically, but verify your format preserves router precision.
Expert quantisation tolerance: individual expert layers tolerate INT4 well because only 2 of 8 experts run per token — errors in unused experts do not affect output. This makes MoE models quantise better than similarly-sized dense models.
Memory bandwidth bottleneck: with 8 expert weight matrices, Mixtral is severely memory-bandwidth limited. Quantisation from FP16 to INT4 nearly quadruples effective bandwidth utilisation, which is why INT4 speed gains are larger than typical dense models.
KV cache is standard: the shared attention layers use normal GQA, so KV cache scales identically to a dense model. See our KV cache explainer for details.

Recommended GPU Configurations

Budget	GPU Config	Best Format	Expected Speed
Entry	RTX 3090 (24 GB)	GGUF Q3_K_M	~18 tok/s
Mid-range	RTX 5090 (32 GB)	GPTQ 4-bit	~38 tok/s
Performance	2x RTX 5090 (64 GB)	GPTQ 4-bit or FP16	45-58 tok/s
Maximum quality	4x RTX 3090 (96 GB)	FP16	~30 tok/s

For broader GPU comparisons, see our best GPU for LLM inference roundup. Browse all model guides in the Model Guides category.

Conclusion

Mixtral 8x7B is one of the best-value MoE models when quantised to INT4 — it delivers performance comparable to much larger dense models while fitting on a single 32 GB GPU. GPTQ 4-bit is the speed leader for GPU serving, while GGUF Q3_K_M is the only option for squeezing onto a 24 GB card. The MoE architecture actually quantises better than dense models, so you lose very little quality even at aggressive compression levels.

Run Mixtral 8x7B on Consumer GPUs

Dedicated GPU servers from a single RTX 3090 to multi-GPU clusters, configured for MoE model inference.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Model Guides

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Mixtral 8x7B Quantization: Fitting MoE on Consumer GPUs

Mixtral 8x7B and Consumer GPUs

VRAM Requirements by Format

Speed Benchmarks

MoE-Specific Quantisation Tips

Recommended GPU Configurations

Conclusion

Run Mixtral 8x7B on Consumer GPUs

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Mixtral 8x7B Quantization: Fitting MoE on Consumer GPUs

Mixtral 8x7B and Consumer GPUs

VRAM Requirements by Format

Speed Benchmarks

MoE-Specific Quantisation Tips

Recommended GPU Configurations

Conclusion

Run Mixtral 8x7B on Consumer GPUs

Need a Dedicated GPU Server?

gigagpu

Related Articles

Mistral Nemo 12B on a Dedicated GPU

Phi-3 for Data Extraction & OCR: GPU Requirements & Setup

8B LLM VRAM Requirements: Llama 3, Qwen, Phi-3 and the Rest

Phi-3 Mini vs Small vs Medium: Size Selection Guide

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?