RTX 3050 - Order Now
Home / Blog / Model Guides / RTX 4090 24GB for Mixtral 8x7B: The Original Open MoE on a Single Card
Model Guides

RTX 4090 24GB for Mixtral 8x7B: The Original Open MoE on a Single Card

Mixtral 8x7B AWQ on the RTX 4090 24GB - 14GB MoE weights, 12.9B active per token, 85 t/s decode, 32k context, Apache 2.0 - now mostly outclassed by Mistral Small 3 and Qwen 2.5 14B.

Mixtral 8x7B-Instruct-v0.1 is the original sparse Mixture-of-Experts open-weight model from Mistral (Dec 2023): 46.7B total parameters but only 12.9B active per token thanks to top-2-of-8 expert routing. AWQ INT4 brings the full weight set down to roughly 14 GB, fitting comfortably on the RTX 4090 24GB with room for the model’s native 32k context. Throughput follows the active-parameter count, not the total – making Mixtral feel much faster than its parameter sticker suggests. This guide on our UK dedicated GPU hosting covers the MoE architecture, the precise VRAM math (full expert set must fit), throughput, and – importantly – why in 2026 Mixtral 8x7B is mostly outclassed by Mistral Small 3 24B and Qwen 2.5 14B for new deployments.

Contents

Why MoE matters here (and where it stops mattering)

A Mixture-of-Experts transformer loads all 8 expert MLPs into VRAM but routes each token through only 2 of them at the routing gate. Two consequences:

  • VRAM cost: 46.7B parameters total – all of them must fit in HBM. There is no offloading the inactive experts to RAM cleanly without killing latency.
  • Compute cost: 12.9B active parameters per token – decode bandwidth is bounded by the active set, not the total. This is why Mixtral feels faster than a 47B dense model.

The 4090 is bandwidth-bound on decode (1008 GB/s memory bandwidth), so Mixtral’s effective speed sits between a 7B and a 14B dense model on the same card – 85 t/s vs Llama 3 8B’s 195 t/s and Qwen 14B’s 110 t/s. The MoE story made sense in 2023 when nothing else combined this quality with this latency. By 2026, dense models like Mistral Small 3 24B (Apache 2.0, MMLU 81) and Qwen 2.5 14B (MMLU 79.7) deliver better quality per active parameter and per VRAM byte. Our Mixtral vs Llama 3 70B comparison covers the broader trajectory.

Architecture, expert routing and KV math

Mixtral 8x7B is a 32-layer transformer with 32 query heads, 8 KV heads (4:1 GQA), 128-dim heads, and 8 expert MLPs per layer (the “8x7B” is misleading – it is 8 experts inside one model, not 8 separate models). Each expert is a SwiGLU MLP at 14,336 intermediate dim. The router is a linear layer that picks the top 2 experts per token; each token then runs through both experts and outputs are weighted-summed. Native context is 32,768 tokens. License is Apache 2.0.

KV cache per token at FP16: 8 KV heads x 128 dim x 32 layers x 2 (K+V) x 2 bytes = 131 KB. At FP8 KV: 65.5 KB. So 32k context = 2.1 GB FP8 per sequence – small relative to the 14 GB AWQ weights, leaving comfortable batch headroom.

The expert routing has subtle latency implications. During prefill, all experts run in parallel across batched tokens which is efficient. During decode, only 2 experts run per token but the routing is dynamic – so you cannot pre-fetch the right expert weights, and unfavourable expert distributions can cause GPU pipeline stalls. In practice on the 4090 this overhead is minor (5-8% of decode time) but it is real.

VRAM accounting across precisions

ComponentFP16FP8 (W8A8)AWQ INT4
Weights (full expert set)87 GB47 GB14 GB
KV @ 32k FP8, 1 seq2.1 GB2.1 GB2.1 GB
CUDA graphs + workspace1.6 GB1.6 GB1.6 GB
vLLM scheduler + activations1.4 GB1.4 GB1.4 GB
Total at batch 1, 32kOOMOOM19.1 GB
Headroom under 24 GB (AWQ)n/an/a4.9 GB
Concurrency budget (AWQ + FP8 KV)Avg contextKV totalVerdict
1 seq32k2.1 GBComfortable, 5 GB free
4 seqs8k2.1 GBComfortable
8 seqs4k2.1 GBComfortable
8 seqs8k4.2 GBComfortable
16 seqs4k4.2 GBTight – cap at 12

Throughput, latency and concurrency

Single 4090, vLLM 0.6.3, AWQ Marlin, FP8 KV, prompt 1024, generate 256:

MetricBatch 1Batch 4Batch 8Batch 16
Decode tokens/sec (per stream)85725234
Aggregate tokens/sec85288416540
Time to first token (1k prompt, ms)180240320440
Memory used (GB)17.017.819.521.4
Prefill 8k batch 1 (t/s)3,200n/an/an/a

The high prefill speed (3,200 t/s) reflects expert parallelism during long prompts. See the dedicated Mixtral benchmark for full concurrency curves.

Quality benchmarks vs alternatives (the legacy story)

BenchmarkMixtral 8x7BMistral Small 3 24BQwen 2.5 14BLlama 3.1 8B
MMLU70.681.079.769.4
GSM8K61.191.090.284.5
HumanEval40.284.886.672.6
MATH28.454.055.630.6
MT-Bench8.308.68.48.10
VRAM AWQ INT414 GB13.4 GB9 GB4.5 GB
4090 single-card t/s (AWQ)8585135245

Mixtral is showing its age. It is now worse than Mistral Small 3 24B (the same hardware footprint, same vendor, newer Apache 2.0 model) on every benchmark. Qwen 2.5 14B is better and faster on smaller VRAM. The only reasons to deploy Mixtral 8x7B in 2026 are: (a) you have an existing fine-tune on it, (b) you specifically need its distinctive instruction-following style or multilingual coverage from the original training mix, or (c) compatibility with downstream tooling that expects the Mixtral architecture.

vLLM deployment and code

python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ \
  --quantization awq_marlin \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --max-num-seqs 8 \
  --enable-chunked-prefill --enable-prefix-caching \
  --gpu-memory-utilization 0.92

Mixtral’s native context is 32k. KV growth is bounded – 4 GB at full 32k for 2 sequences – so the 24 GB envelope holds even at long context. The Mistral chat template is non-obvious:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1")
msgs = [
  {"role":"user","content":"Explain MoE routing in 200 words."}
]
prompt = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
# <s>[INST] Explain MoE routing in 200 words. [/INST]
# Note the [INST]...[/INST] format - very different from ChatML

The vLLM setup guide covers driver requirements; the AWQ guide covers quantisation options.

Production gotchas (seven items)

  • FP8 will not fit a 24 GB card: full expert set at FP8 is 47 GB. AWQ INT4 is the only single-card path on the 4090. For FP8 you need an A100 80GB or H100.
  • Expert routing imbalance under low-batch load: at batch 1 with adversarial prompts, all 8 experts can be activated unevenly, causing minor pipeline stalls. Not a problem at batch 4+.
  • [INST] template is not ChatML: mixing Mixtral with libraries that assume ChatML will silently produce degraded outputs. Always use AutoTokenizer.apply_chat_template.
  • Mistral tokenizer uses BOS by default: vLLM’s --add-bos-token default is False; for Mixtral you should set it True or your prompts encode subtly differently than during training.
  • KV cache prefix caching has lower hit rate on MoE: because expert routing is input-dependent, identical prefixes can sometimes route differently than during the cached run. Enable prefix caching but expect ~30% lower hit rate than dense models.
  • FlashAttention version: Mixtral works with FlashAttention 2.4+. Newer versions are fine but offer no Mixtral-specific gains.
  • Power and thermal at sustained batch 16: 4090 draws 440-450 W. Cap to 410 W with nvidia-smi -pl 410 for stable throughput, see power draw analysis.

When to pick this over alternatives

Pick Mixtral 8x7B today only when you have an existing investment – LoRA adapters, fine-tunes, downstream tooling – that targets it. For a green-field deployment, pick Mistral Small 3 24B instead: same vendor, same hardware footprint, newer architecture, better on every benchmark. Pick Qwen 2.5 14B if you want strictly higher throughput at the same or better quality.

Step up to Llama 3.1 70B INT4 across two cards if you need MMLU 83+. Step laterally to DeepSeek Coder V2 Lite for MoE coding workloads. Step down to Mistral 7B or Mistral Nemo 12B for higher single-card throughput in the Mistral family.

Verdict

Mixtral 8x7B AWQ on the RTX 4090 24GB still works – 85 t/s decode, 32k context, Apache 2.0, fits cleanly on a single card. But in 2026 it is a legacy choice. Mistral’s own newer Mistral Small 3 24B beats it on every benchmark at the same VRAM cost, and Qwen 2.5 14B is faster at higher quality. The MoE-vs-dense throughput edge that made Mixtral compelling in 2023 has been eroded by GQA improvements and FP8 tensor cores in newer dense models. Keep this guide as reference for legacy deployments and existing fine-tunes; for new builds, follow the Mistral Small 3 guide instead. Compare with the RTX 5090 32GB only if you specifically need FP8 Mixtral with full headroom.

Run Mixtral 8x7B on a UK RTX 4090

AWQ INT4 MoE, 32k context, 85 t/s decode. UK dedicated hosting from Manchester.

Order the RTX 4090 24GB

See also: Mistral Small 3 24B, Mistral Nemo 12B, Qwen 2.5 14B, Mixtral benchmark, Mixtral vs Llama 3 70B, DeepSeek Coder V2 Lite (MoE), 4090 vs 5090.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?