RTX 3050 - Order Now
Home / Blog / Benchmarks / RTX 4090 24GB Mixtral Benchmark: 8x7B Fits, 8x22B Does Not
Benchmarks

RTX 4090 24GB Mixtral Benchmark: 8x7B Fits, 8x22B Does Not

Mixtral 8x7B AWQ runs on a single RTX 4090 24GB at 85 t/s with 14GB of weights and 480 t/s aggregate at batch 16. Mixtral 8x22B does not fit even in INT4. Per-batch decode, prefill, expert routing, and full cross-card comparison.

Mixtral 8x7B is the canonical sparse Mixture-of-Experts model: 46.7B total parameters, but only two of eight experts activate per token, so the per-token compute and bandwidth profile is closer to a 12.9B dense model. That mismatch between storage cost and compute cost is exactly what a 24GB consumer card has trouble with — you need the VRAM to hold all eight experts, but the compute budget is set by just two of them. On an RTX 4090 24GB from Gigagpu, the AWQ INT4 build occupies 14.0 GB of weights and decodes at 85 tokens per second at batch one. Mixtral 8x22B (141B total) does not fit on a single 24GB card under any current quantisation; AWQ alone is roughly 70 GB.

Contents

Does it fit? Quantisation footprint

Mixtral 8x7B holds 46.7 billion parameters across eight feed-forward expert blocks plus shared attention. At FP16, weights alone are 87 GB — well outside any 24GB envelope. The story changes dramatically with INT4 weight-only quantisation, which compresses each expert independently without needing to retrain the router. AWQ Marlin kernels keep dequantisation fused into the matmul, so the latency penalty is small relative to FP8.

ModelFormatWeights on diskResident on 4090Fits 24GB?
Mixtral 8x7BFP1687.0 GBNo
Mixtral 8x7BFP843.5 GBNo
Mixtral 8x7BAWQ INT414.0 GB16.5 GB w/ KVYes
Mixtral 8x7BGGUF Q4_K_M15.6 GB18.0 GB w/ KVYes (tight)
Mixtral 8x7BGGUF Q5_K_M19.4 GB21.5 GB w/ KVYes, no batching headroom
Mixtral 8x22BFP16282 GBNo
Mixtral 8x22BAWQ INT4~70 GBNo
Mixtral 8x22BGGUF Q2_K~52 GBNo

The practical message: AWQ Marlin is the only sensible production path on a single 4090. GGUF works for hobby use through llama.cpp but does not benefit from the same fused kernels and tops out around 72 t/s decode. For 8x22B you will need at least two 4090s with tensor parallelism, or step up to an H100 80GB; see the 4090 vs 5090 comparison for the case where 32GB starts mattering.

Methodology and test rig

All numbers below are measured on a stock RTX 4090 24GB Founders Edition at 450W TDP, no overclock, with PCIe 4.0 x16 link. Host platform is a Ryzen 9 7950X with 64 GB DDR5-5600 and a Samsung 990 Pro 2TB NVMe; OS is Ubuntu 24.04 LTS, NVIDIA driver 560.x, CUDA 12.6. Inference stack is vLLM 0.6.4 with PyTorch 2.5 and FlashAttention 2.6 against the AWQ Marlin kernel path.

Each measurement runs for 60 seconds of warm traffic after a 30-second warm-up, with prompts drawn from a logged production conversation distribution (median 380 input tokens, p99 2,100, mean output 220 tokens). Latency percentiles are reported per-stream, throughput is aggregated across all in-flight sequences. Power draw was sampled via nvidia-smi dmon at 1 Hz; thermals stayed under 78 °C throughout.

python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ \
  --quantization awq_marlin --kv-cache-dtype fp8 \
  --max-model-len 32768 --max-num-seqs 8 \
  --enable-chunked-prefill --enable-prefix-caching \
  --gpu-memory-utilization 0.93

FP8 KV cache halves the per-token KV footprint relative to FP16 with a measured perplexity delta under 0.02. The chunked-prefill flag breaks long prompts into 512-token chunks so they interleave with decode steps, dramatically improving p99 TTFT under mixed traffic. Without it, a 4,000-token prompt freezes the decode loop for ~280 ms.

Decode throughput per batch

Mixtral’s bandwidth-bound decode is set by the size of the two active experts plus the shared attention layers — roughly 14 GB / 1008 GB/s ≈ 14 ms per token in the bandwidth-limited regime, which matches the observed 85 t/s at batch 1 closely. As batch grows, the same expert weights are reused across multiple tokens, so aggregate throughput climbs steeply until KV cache pressure caps it.

BatchPer-stream t/sAggregate t/sSpeed-up vs b=1VRAM
185851.0x16.5 GB
2721441.7x17.5 GB
4702803.3x19.0 GB
8524204.9x21.8 GB
16304805.6x (memory cap)23.4 GB
32OOMKV pressure

Aggregate throughput peaks at 480 t/s around batch 16 — beyond that, FP8 KV cache for thirty-two 4k-context conversations exceeds the remaining 6 GB of headroom. If you need batch 32 you must drop max_model_len to 8k or migrate to the 5090’s 32GB. For comparison, the same Mixtral build on llama.cpp with GGUF Q4_K_M tops out around 230 t/s aggregate at batch 8.

Prefill, TTFT and p50/p99 latency

Prefill on a 4090 is compute-bound, not memory-bound, because every token sees both attention QKV projections and the routed FFN. Mixtral processes prefill at roughly 3,200 input tokens per second on this hardware. That is the number that sets your TTFT for chat-style traffic.

Prompt lengthp50 TTFTp99 TTFTp50 e2e (200 tok out)p99 e2e
256 tokens110 ms185 ms2.5 s3.1 s
1,024 tokens340 ms520 ms2.7 s3.4 s
2,048 tokens670 ms980 ms3.1 s4.0 s
4,096 tokens (chunked)1.31 s1.84 s3.7 s4.9 s
8,192 tokens (chunked)2.59 s3.41 s5.0 s6.7 s

Prefix caching is the single biggest TTFT win for chatbot workloads where the system prompt is stable: a 1,500-token preamble that previously cost ~470 ms of prefill drops to ~15 ms cache lookup, freeing the prefill compute for the user’s actual message.

Expert routing physics

The Mixtral router selects the top-2 experts per token via a softmax over 8 logits. Empirically, across a diverse prompt mix, the load is roughly uniform — each expert is hit on 23-27% of tokens within any 64-token window. This means two things on a 24GB card.

First, expert offloading to CPU RAM, which works for some Llama 70B FFN-offload tricks where only a small slice of weights is touched per second, does not help here. Within four decode steps almost every expert has been activated; you would PCIe-pump 14 GB of weights every 50 ms, which is roughly five times the 30 GB/s sustained PCIe 4.0 x16 throughput. Throughput collapses to 6-8 t/s.

Second, the per-token compute path is genuinely smaller than a dense 13B — only the attention matrices and the two selected expert FFNs participate. That is why Mixtral 8x7B decode runs faster than Llama 2 13B on the same hardware despite holding 3.5 times more weights. The trade-off is purely memory: you pay for storing the experts you do not use this token, but you might use them next token.

Cross-card comparison

Mixtral 8x7B AWQ batch-1 decode across the realistic candidate stack, all using the same vLLM 0.6.4 + AWQ Marlin recipe (3090 falls back to the older AWQ kernel because Marlin needs Ada or newer):

GPUVRAMBandwidthMixtral 8x7B b=1Mixtral 8x7B b=8 aggNotes
RTX 5060 Ti 16GB16448 GB/sdoes not fitNeed at least 16.5 GB for AWQ + KV
RTX 5080 16GB16960 GB/sdoes not fitSame constraint
RTX 3090 24GB24936 GB/s62 t/s290 t/sNo FP8 path, AWQ kernel only
RTX 4090 24GB241008 GB/s85 t/s420 t/sMarlin + FP8 KV
RTX 5090 32GB321792 GB/s148 t/s730 t/sHeadroom for batch 32
H100 80GB803350 GB/s175 t/s1100 t/sTrivially serves 8x22B too

The 4090 is the cheapest card on which Mixtral 8x7B is comfortably production-grade. The 3090 works but loses ~27% on decode and lacks FP8 KV. The 5090 doubles bandwidth and unlocks batch 32 with 8k context. For 8x22B you need an H100 or two 4090s with tensor parallelism — see Llama 70B INT4 on 4090 for the analogous fit story.

Mixtral vs Llama 70B vs Qwen 14B

The interesting question is rarely “which Mixtral?” but “should I pick Mixtral at all?” Compared to its peers on a single 4090:

MetricMixtral 8x7B AWQLlama 3.1 70B AWQQwen 2.5 14B AWQ
Resident VRAM16.5 GB21.0 GB9.5 GB
Decode b=185 t/s23 t/s135 t/s
Aggregate b=8420 t/s72 t/s620 t/s
MMLU 5-shot70.686.079.7
HumanEval40.280.579.4
Multilingual MGSM71.389.276.8
Active experts / params2 of 8 / 12.9Bdense 70Bdense 14B

Qwen 2.5 14B beats Mixtral 8x7B on essentially every quality dimension at a smaller VRAM footprint and higher throughput. Llama 70B INT4 wins on MMLU and code at the cost of 4x slower decode. The honest verdict is that Mixtral 8x7B’s role today is narrow: legacy fine-tunes, router-conditioned downstream tasks, or the specific case where you need the MoE inductive bias for retrieval-augmented multilingual work. Most fresh deployments should default to Qwen 14B.

Production gotchas

  • AWQ Marlin requires Ada or newer. If you mix 4090s and 3090s in a fleet, the 3090 falls back to the slower legacy AWQ kernel and your routing becomes unbalanced. Pin Mixtral traffic to Ada-class cards.
  • FP8 KV cache breaks for some sampling configs. Setting top_p below 0.05 with FP8 KV occasionally produces NaN logits on Mixtral specifically; fall back to FP16 KV (and lose ~3 t/s) for sub-1% samplers.
  • Expert imbalance under adversarial prompts. Repetitive prompts can route 70%+ of tokens through a single expert pair, dropping decode to 60 t/s. Real production traffic doesn’t trigger this, but synthetic benchmarks sometimes do.
  • 32k context is a trap. Mixtral was trained at 32k but quality above 16k drifts noticeably. Cap max_model_len to 16384 unless you have measured your domain-specific recall above that.
  • Tokeniser mismatch with HF Transformers. The Mistral-format tokeniser sometimes ships with subtly different special-token handling than the HF version; lock the tokeniser file alongside the weights.
  • Cold-start is 12 seconds. AWQ dequantisation of all eight experts on first load takes longer than dense models. Pre-warm before opening the load balancer.
  • Watch the L2 cache stat. The 4090’s 72 MB L2 helps Mixtral more than it helps dense models because attention reuses the same KV blocks; nvidia-smi nvlink is irrelevant on consumer cards but L2 hit rate on dcgmi dmon tells you whether your batching is degenerate.

Verdict: when to pick Mixtral on a 4090

Pick Mixtral 8x7B on a 4090 when you have an existing investment in Mixtral fine-tunes, when your traffic is genuinely multilingual and you have measured Qwen lagging, or when you need the slightly higher batch-1 latency floor for a low-concurrency interactive workload (85 t/s is comfortable for streaming). Skip it for new builds — pick Qwen 14B for general SaaS, Llama 3 8B FP8 for raw throughput, or Llama 70B INT4 when answer fidelity wins over latency. The full deployment recipe lives in our vLLM setup and AWQ quantisation guide.

Mixtral 8x7B on a single 4090

85 t/s at batch one, 480 t/s aggregate, 14 GB AWQ weights. UK dedicated hosting, predictable monthly billing, no token-metered API surprises.

Order the RTX 4090 24GB

See also: Mixtral 8x7B model guide, Qwen 14B benchmark, Llama 70B INT4, AWQ guide, vLLM setup, Mixtral vs Llama 70B, 4090 spec breakdown.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?