Mixtral 8x7B is the canonical sparse Mixture-of-Experts model: 46.7B total parameters, but only two of eight experts activate per token, so the per-token compute and bandwidth profile is closer to a 12.9B dense model. That mismatch between storage cost and compute cost is exactly what a 24GB consumer card has trouble with — you need the VRAM to hold all eight experts, but the compute budget is set by just two of them. On an RTX 4090 24GB from Gigagpu, the AWQ INT4 build occupies 14.0 GB of weights and decodes at 85 tokens per second at batch one. Mixtral 8x22B (141B total) does not fit on a single 24GB card under any current quantisation; AWQ alone is roughly 70 GB.
Contents
- Does it fit? Quantisation footprint
- Methodology and test rig
- Decode throughput per batch
- Prefill, TTFT and p50/p99 latency
- Expert routing physics
- Cross-card comparison
- Mixtral vs Llama 70B vs Qwen 14B
- Production gotchas
- Verdict: when to pick Mixtral
Does it fit? Quantisation footprint
Mixtral 8x7B holds 46.7 billion parameters across eight feed-forward expert blocks plus shared attention. At FP16, weights alone are 87 GB — well outside any 24GB envelope. The story changes dramatically with INT4 weight-only quantisation, which compresses each expert independently without needing to retrain the router. AWQ Marlin kernels keep dequantisation fused into the matmul, so the latency penalty is small relative to FP8.
| Model | Format | Weights on disk | Resident on 4090 | Fits 24GB? |
|---|---|---|---|---|
| Mixtral 8x7B | FP16 | 87.0 GB | — | No |
| Mixtral 8x7B | FP8 | 43.5 GB | — | No |
| Mixtral 8x7B | AWQ INT4 | 14.0 GB | 16.5 GB w/ KV | Yes |
| Mixtral 8x7B | GGUF Q4_K_M | 15.6 GB | 18.0 GB w/ KV | Yes (tight) |
| Mixtral 8x7B | GGUF Q5_K_M | 19.4 GB | 21.5 GB w/ KV | Yes, no batching headroom |
| Mixtral 8x22B | FP16 | 282 GB | — | No |
| Mixtral 8x22B | AWQ INT4 | ~70 GB | — | No |
| Mixtral 8x22B | GGUF Q2_K | ~52 GB | — | No |
The practical message: AWQ Marlin is the only sensible production path on a single 4090. GGUF works for hobby use through llama.cpp but does not benefit from the same fused kernels and tops out around 72 t/s decode. For 8x22B you will need at least two 4090s with tensor parallelism, or step up to an H100 80GB; see the 4090 vs 5090 comparison for the case where 32GB starts mattering.
Methodology and test rig
All numbers below are measured on a stock RTX 4090 24GB Founders Edition at 450W TDP, no overclock, with PCIe 4.0 x16 link. Host platform is a Ryzen 9 7950X with 64 GB DDR5-5600 and a Samsung 990 Pro 2TB NVMe; OS is Ubuntu 24.04 LTS, NVIDIA driver 560.x, CUDA 12.6. Inference stack is vLLM 0.6.4 with PyTorch 2.5 and FlashAttention 2.6 against the AWQ Marlin kernel path.
Each measurement runs for 60 seconds of warm traffic after a 30-second warm-up, with prompts drawn from a logged production conversation distribution (median 380 input tokens, p99 2,100, mean output 220 tokens). Latency percentiles are reported per-stream, throughput is aggregated across all in-flight sequences. Power draw was sampled via nvidia-smi dmon at 1 Hz; thermals stayed under 78 °C throughout.
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ \
--quantization awq_marlin --kv-cache-dtype fp8 \
--max-model-len 32768 --max-num-seqs 8 \
--enable-chunked-prefill --enable-prefix-caching \
--gpu-memory-utilization 0.93
FP8 KV cache halves the per-token KV footprint relative to FP16 with a measured perplexity delta under 0.02. The chunked-prefill flag breaks long prompts into 512-token chunks so they interleave with decode steps, dramatically improving p99 TTFT under mixed traffic. Without it, a 4,000-token prompt freezes the decode loop for ~280 ms.
Decode throughput per batch
Mixtral’s bandwidth-bound decode is set by the size of the two active experts plus the shared attention layers — roughly 14 GB / 1008 GB/s ≈ 14 ms per token in the bandwidth-limited regime, which matches the observed 85 t/s at batch 1 closely. As batch grows, the same expert weights are reused across multiple tokens, so aggregate throughput climbs steeply until KV cache pressure caps it.
| Batch | Per-stream t/s | Aggregate t/s | Speed-up vs b=1 | VRAM |
|---|---|---|---|---|
| 1 | 85 | 85 | 1.0x | 16.5 GB |
| 2 | 72 | 144 | 1.7x | 17.5 GB |
| 4 | 70 | 280 | 3.3x | 19.0 GB |
| 8 | 52 | 420 | 4.9x | 21.8 GB |
| 16 | 30 | 480 | 5.6x (memory cap) | 23.4 GB |
| 32 | — | OOM | — | KV pressure |
Aggregate throughput peaks at 480 t/s around batch 16 — beyond that, FP8 KV cache for thirty-two 4k-context conversations exceeds the remaining 6 GB of headroom. If you need batch 32 you must drop max_model_len to 8k or migrate to the 5090’s 32GB. For comparison, the same Mixtral build on llama.cpp with GGUF Q4_K_M tops out around 230 t/s aggregate at batch 8.
Prefill, TTFT and p50/p99 latency
Prefill on a 4090 is compute-bound, not memory-bound, because every token sees both attention QKV projections and the routed FFN. Mixtral processes prefill at roughly 3,200 input tokens per second on this hardware. That is the number that sets your TTFT for chat-style traffic.
| Prompt length | p50 TTFT | p99 TTFT | p50 e2e (200 tok out) | p99 e2e |
|---|---|---|---|---|
| 256 tokens | 110 ms | 185 ms | 2.5 s | 3.1 s |
| 1,024 tokens | 340 ms | 520 ms | 2.7 s | 3.4 s |
| 2,048 tokens | 670 ms | 980 ms | 3.1 s | 4.0 s |
| 4,096 tokens (chunked) | 1.31 s | 1.84 s | 3.7 s | 4.9 s |
| 8,192 tokens (chunked) | 2.59 s | 3.41 s | 5.0 s | 6.7 s |
Prefix caching is the single biggest TTFT win for chatbot workloads where the system prompt is stable: a 1,500-token preamble that previously cost ~470 ms of prefill drops to ~15 ms cache lookup, freeing the prefill compute for the user’s actual message.
Expert routing physics
The Mixtral router selects the top-2 experts per token via a softmax over 8 logits. Empirically, across a diverse prompt mix, the load is roughly uniform — each expert is hit on 23-27% of tokens within any 64-token window. This means two things on a 24GB card.
First, expert offloading to CPU RAM, which works for some Llama 70B FFN-offload tricks where only a small slice of weights is touched per second, does not help here. Within four decode steps almost every expert has been activated; you would PCIe-pump 14 GB of weights every 50 ms, which is roughly five times the 30 GB/s sustained PCIe 4.0 x16 throughput. Throughput collapses to 6-8 t/s.
Second, the per-token compute path is genuinely smaller than a dense 13B — only the attention matrices and the two selected expert FFNs participate. That is why Mixtral 8x7B decode runs faster than Llama 2 13B on the same hardware despite holding 3.5 times more weights. The trade-off is purely memory: you pay for storing the experts you do not use this token, but you might use them next token.
Cross-card comparison
Mixtral 8x7B AWQ batch-1 decode across the realistic candidate stack, all using the same vLLM 0.6.4 + AWQ Marlin recipe (3090 falls back to the older AWQ kernel because Marlin needs Ada or newer):
| GPU | VRAM | Bandwidth | Mixtral 8x7B b=1 | Mixtral 8x7B b=8 agg | Notes |
|---|---|---|---|---|---|
| RTX 5060 Ti 16GB | 16 | 448 GB/s | does not fit | — | Need at least 16.5 GB for AWQ + KV |
| RTX 5080 16GB | 16 | 960 GB/s | does not fit | — | Same constraint |
| RTX 3090 24GB | 24 | 936 GB/s | 62 t/s | 290 t/s | No FP8 path, AWQ kernel only |
| RTX 4090 24GB | 24 | 1008 GB/s | 85 t/s | 420 t/s | Marlin + FP8 KV |
| RTX 5090 32GB | 32 | 1792 GB/s | 148 t/s | 730 t/s | Headroom for batch 32 |
| H100 80GB | 80 | 3350 GB/s | 175 t/s | 1100 t/s | Trivially serves 8x22B too |
The 4090 is the cheapest card on which Mixtral 8x7B is comfortably production-grade. The 3090 works but loses ~27% on decode and lacks FP8 KV. The 5090 doubles bandwidth and unlocks batch 32 with 8k context. For 8x22B you need an H100 or two 4090s with tensor parallelism — see Llama 70B INT4 on 4090 for the analogous fit story.
Mixtral vs Llama 70B vs Qwen 14B
The interesting question is rarely “which Mixtral?” but “should I pick Mixtral at all?” Compared to its peers on a single 4090:
| Metric | Mixtral 8x7B AWQ | Llama 3.1 70B AWQ | Qwen 2.5 14B AWQ |
|---|---|---|---|
| Resident VRAM | 16.5 GB | 21.0 GB | 9.5 GB |
| Decode b=1 | 85 t/s | 23 t/s | 135 t/s |
| Aggregate b=8 | 420 t/s | 72 t/s | 620 t/s |
| MMLU 5-shot | 70.6 | 86.0 | 79.7 |
| HumanEval | 40.2 | 80.5 | 79.4 |
| Multilingual MGSM | 71.3 | 89.2 | 76.8 |
| Active experts / params | 2 of 8 / 12.9B | dense 70B | dense 14B |
Qwen 2.5 14B beats Mixtral 8x7B on essentially every quality dimension at a smaller VRAM footprint and higher throughput. Llama 70B INT4 wins on MMLU and code at the cost of 4x slower decode. The honest verdict is that Mixtral 8x7B’s role today is narrow: legacy fine-tunes, router-conditioned downstream tasks, or the specific case where you need the MoE inductive bias for retrieval-augmented multilingual work. Most fresh deployments should default to Qwen 14B.
Production gotchas
- AWQ Marlin requires Ada or newer. If you mix 4090s and 3090s in a fleet, the 3090 falls back to the slower legacy AWQ kernel and your routing becomes unbalanced. Pin Mixtral traffic to Ada-class cards.
- FP8 KV cache breaks for some sampling configs. Setting
top_pbelow 0.05 with FP8 KV occasionally produces NaN logits on Mixtral specifically; fall back to FP16 KV (and lose ~3 t/s) for sub-1% samplers. - Expert imbalance under adversarial prompts. Repetitive prompts can route 70%+ of tokens through a single expert pair, dropping decode to 60 t/s. Real production traffic doesn’t trigger this, but synthetic benchmarks sometimes do.
- 32k context is a trap. Mixtral was trained at 32k but quality above 16k drifts noticeably. Cap
max_model_lento 16384 unless you have measured your domain-specific recall above that. - Tokeniser mismatch with HF Transformers. The Mistral-format tokeniser sometimes ships with subtly different special-token handling than the HF version; lock the tokeniser file alongside the weights.
- Cold-start is 12 seconds. AWQ dequantisation of all eight experts on first load takes longer than dense models. Pre-warm before opening the load balancer.
- Watch the L2 cache stat. The 4090’s 72 MB L2 helps Mixtral more than it helps dense models because attention reuses the same KV blocks;
nvidia-smi nvlinkis irrelevant on consumer cards but L2 hit rate ondcgmi dmontells you whether your batching is degenerate.
Verdict: when to pick Mixtral on a 4090
Pick Mixtral 8x7B on a 4090 when you have an existing investment in Mixtral fine-tunes, when your traffic is genuinely multilingual and you have measured Qwen lagging, or when you need the slightly higher batch-1 latency floor for a low-concurrency interactive workload (85 t/s is comfortable for streaming). Skip it for new builds — pick Qwen 14B for general SaaS, Llama 3 8B FP8 for raw throughput, or Llama 70B INT4 when answer fidelity wins over latency. The full deployment recipe lives in our vLLM setup and AWQ quantisation guide.
Mixtral 8x7B on a single 4090
85 t/s at batch one, 480 t/s aggregate, 14 GB AWQ weights. UK dedicated hosting, predictable monthly billing, no token-metered API surprises.
Order the RTX 4090 24GBSee also: Mixtral 8x7B model guide, Qwen 14B benchmark, Llama 70B INT4, AWQ guide, vLLM setup, Mixtral vs Llama 70B, 4090 spec breakdown.