Mixtral 8x7B-Instruct-v0.1 is the original sparse Mixture-of-Experts open-weight model from Mistral (Dec 2023): 46.7B total parameters but only 12.9B active per token thanks to top-2-of-8 expert routing. AWQ INT4 brings the full weight set down to roughly 14 GB, fitting comfortably on the RTX 4090 24GB with room for the model’s native 32k context. Throughput follows the active-parameter count, not the total – making Mixtral feel much faster than its parameter sticker suggests. This guide on our UK dedicated GPU hosting covers the MoE architecture, the precise VRAM math (full expert set must fit), throughput, and – importantly – why in 2026 Mixtral 8x7B is mostly outclassed by Mistral Small 3 24B and Qwen 2.5 14B for new deployments.
Contents
- Why MoE matters here (and where it stops mattering)
- Architecture, expert routing and KV math
- VRAM accounting across precisions
- Throughput, latency and concurrency
- Quality benchmarks vs alternatives (the legacy story)
- vLLM deployment and code
- Production gotchas (seven items)
- When to pick this over alternatives
- Verdict
Why MoE matters here (and where it stops mattering)
A Mixture-of-Experts transformer loads all 8 expert MLPs into VRAM but routes each token through only 2 of them at the routing gate. Two consequences:
- VRAM cost: 46.7B parameters total – all of them must fit in HBM. There is no offloading the inactive experts to RAM cleanly without killing latency.
- Compute cost: 12.9B active parameters per token – decode bandwidth is bounded by the active set, not the total. This is why Mixtral feels faster than a 47B dense model.
The 4090 is bandwidth-bound on decode (1008 GB/s memory bandwidth), so Mixtral’s effective speed sits between a 7B and a 14B dense model on the same card – 85 t/s vs Llama 3 8B’s 195 t/s and Qwen 14B’s 110 t/s. The MoE story made sense in 2023 when nothing else combined this quality with this latency. By 2026, dense models like Mistral Small 3 24B (Apache 2.0, MMLU 81) and Qwen 2.5 14B (MMLU 79.7) deliver better quality per active parameter and per VRAM byte. Our Mixtral vs Llama 3 70B comparison covers the broader trajectory.
Architecture, expert routing and KV math
Mixtral 8x7B is a 32-layer transformer with 32 query heads, 8 KV heads (4:1 GQA), 128-dim heads, and 8 expert MLPs per layer (the “8x7B” is misleading – it is 8 experts inside one model, not 8 separate models). Each expert is a SwiGLU MLP at 14,336 intermediate dim. The router is a linear layer that picks the top 2 experts per token; each token then runs through both experts and outputs are weighted-summed. Native context is 32,768 tokens. License is Apache 2.0.
KV cache per token at FP16: 8 KV heads x 128 dim x 32 layers x 2 (K+V) x 2 bytes = 131 KB. At FP8 KV: 65.5 KB. So 32k context = 2.1 GB FP8 per sequence – small relative to the 14 GB AWQ weights, leaving comfortable batch headroom.
The expert routing has subtle latency implications. During prefill, all experts run in parallel across batched tokens which is efficient. During decode, only 2 experts run per token but the routing is dynamic – so you cannot pre-fetch the right expert weights, and unfavourable expert distributions can cause GPU pipeline stalls. In practice on the 4090 this overhead is minor (5-8% of decode time) but it is real.
VRAM accounting across precisions
| Component | FP16 | FP8 (W8A8) | AWQ INT4 |
|---|---|---|---|
| Weights (full expert set) | 87 GB | 47 GB | 14 GB |
| KV @ 32k FP8, 1 seq | 2.1 GB | 2.1 GB | 2.1 GB |
| CUDA graphs + workspace | 1.6 GB | 1.6 GB | 1.6 GB |
| vLLM scheduler + activations | 1.4 GB | 1.4 GB | 1.4 GB |
| Total at batch 1, 32k | OOM | OOM | 19.1 GB |
| Headroom under 24 GB (AWQ) | n/a | n/a | 4.9 GB |
| Concurrency budget (AWQ + FP8 KV) | Avg context | KV total | Verdict |
|---|---|---|---|
| 1 seq | 32k | 2.1 GB | Comfortable, 5 GB free |
| 4 seqs | 8k | 2.1 GB | Comfortable |
| 8 seqs | 4k | 2.1 GB | Comfortable |
| 8 seqs | 8k | 4.2 GB | Comfortable |
| 16 seqs | 4k | 4.2 GB | Tight – cap at 12 |
Throughput, latency and concurrency
Single 4090, vLLM 0.6.3, AWQ Marlin, FP8 KV, prompt 1024, generate 256:
| Metric | Batch 1 | Batch 4 | Batch 8 | Batch 16 |
|---|---|---|---|---|
| Decode tokens/sec (per stream) | 85 | 72 | 52 | 34 |
| Aggregate tokens/sec | 85 | 288 | 416 | 540 |
| Time to first token (1k prompt, ms) | 180 | 240 | 320 | 440 |
| Memory used (GB) | 17.0 | 17.8 | 19.5 | 21.4 |
| Prefill 8k batch 1 (t/s) | 3,200 | n/a | n/a | n/a |
The high prefill speed (3,200 t/s) reflects expert parallelism during long prompts. See the dedicated Mixtral benchmark for full concurrency curves.
Quality benchmarks vs alternatives (the legacy story)
| Benchmark | Mixtral 8x7B | Mistral Small 3 24B | Qwen 2.5 14B | Llama 3.1 8B |
|---|---|---|---|---|
| MMLU | 70.6 | 81.0 | 79.7 | 69.4 |
| GSM8K | 61.1 | 91.0 | 90.2 | 84.5 |
| HumanEval | 40.2 | 84.8 | 86.6 | 72.6 |
| MATH | 28.4 | 54.0 | 55.6 | 30.6 |
| MT-Bench | 8.30 | 8.6 | 8.4 | 8.10 |
| VRAM AWQ INT4 | 14 GB | 13.4 GB | 9 GB | 4.5 GB |
| 4090 single-card t/s (AWQ) | 85 | 85 | 135 | 245 |
Mixtral is showing its age. It is now worse than Mistral Small 3 24B (the same hardware footprint, same vendor, newer Apache 2.0 model) on every benchmark. Qwen 2.5 14B is better and faster on smaller VRAM. The only reasons to deploy Mixtral 8x7B in 2026 are: (a) you have an existing fine-tune on it, (b) you specifically need its distinctive instruction-following style or multilingual coverage from the original training mix, or (c) compatibility with downstream tooling that expects the Mixtral architecture.
vLLM deployment and code
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ \
--quantization awq_marlin \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--max-num-seqs 8 \
--enable-chunked-prefill --enable-prefix-caching \
--gpu-memory-utilization 0.92
Mixtral’s native context is 32k. KV growth is bounded – 4 GB at full 32k for 2 sequences – so the 24 GB envelope holds even at long context. The Mistral chat template is non-obvious:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1")
msgs = [
{"role":"user","content":"Explain MoE routing in 200 words."}
]
prompt = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
# <s>[INST] Explain MoE routing in 200 words. [/INST]
# Note the [INST]...[/INST] format - very different from ChatML
The vLLM setup guide covers driver requirements; the AWQ guide covers quantisation options.
Production gotchas (seven items)
- FP8 will not fit a 24 GB card: full expert set at FP8 is 47 GB. AWQ INT4 is the only single-card path on the 4090. For FP8 you need an A100 80GB or H100.
- Expert routing imbalance under low-batch load: at batch 1 with adversarial prompts, all 8 experts can be activated unevenly, causing minor pipeline stalls. Not a problem at batch 4+.
- [INST] template is not ChatML: mixing Mixtral with libraries that assume ChatML will silently produce degraded outputs. Always use AutoTokenizer.apply_chat_template.
- Mistral tokenizer uses BOS by default: vLLM’s
--add-bos-tokendefault is False; for Mixtral you should set it True or your prompts encode subtly differently than during training. - KV cache prefix caching has lower hit rate on MoE: because expert routing is input-dependent, identical prefixes can sometimes route differently than during the cached run. Enable prefix caching but expect ~30% lower hit rate than dense models.
- FlashAttention version: Mixtral works with FlashAttention 2.4+. Newer versions are fine but offer no Mixtral-specific gains.
- Power and thermal at sustained batch 16: 4090 draws 440-450 W. Cap to 410 W with
nvidia-smi -pl 410for stable throughput, see power draw analysis.
When to pick this over alternatives
Pick Mixtral 8x7B today only when you have an existing investment – LoRA adapters, fine-tunes, downstream tooling – that targets it. For a green-field deployment, pick Mistral Small 3 24B instead: same vendor, same hardware footprint, newer architecture, better on every benchmark. Pick Qwen 2.5 14B if you want strictly higher throughput at the same or better quality.
Step up to Llama 3.1 70B INT4 across two cards if you need MMLU 83+. Step laterally to DeepSeek Coder V2 Lite for MoE coding workloads. Step down to Mistral 7B or Mistral Nemo 12B for higher single-card throughput in the Mistral family.
Verdict
Mixtral 8x7B AWQ on the RTX 4090 24GB still works – 85 t/s decode, 32k context, Apache 2.0, fits cleanly on a single card. But in 2026 it is a legacy choice. Mistral’s own newer Mistral Small 3 24B beats it on every benchmark at the same VRAM cost, and Qwen 2.5 14B is faster at higher quality. The MoE-vs-dense throughput edge that made Mixtral compelling in 2023 has been eroded by GQA improvements and FP8 tensor cores in newer dense models. Keep this guide as reference for legacy deployments and existing fine-tunes; for new builds, follow the Mistral Small 3 guide instead. Compare with the RTX 5090 32GB only if you specifically need FP8 Mixtral with full headroom.
Run Mixtral 8x7B on a UK RTX 4090
AWQ INT4 MoE, 32k context, 85 t/s decode. UK dedicated hosting from Manchester.
Order the RTX 4090 24GBSee also: Mistral Small 3 24B, Mistral Nemo 12B, Qwen 2.5 14B, Mixtral benchmark, Mixtral vs Llama 3 70B, DeepSeek Coder V2 Lite (MoE), 4090 vs 5090.