RTX 4090 24GB Mixtral Benchmark: 8x7B Fits, 8x22B Does Not GIGAGPU

Mixtral 8x7B is the canonical sparse Mixture-of-Experts model: 46.7B total parameters, but only two of eight experts activate per token, so the per-token compute and bandwidth profile is closer to a 12.9B dense model. That mismatch between storage cost and compute cost is exactly what a 24GB consumer card has trouble with — you need the VRAM to hold all eight experts, but the compute budget is set by just two of them. On an RTX 4090 24GB from Gigagpu, the AWQ INT4 build occupies 14.0 GB of weights and decodes at 85 tokens per second at batch one. Mixtral 8x22B (141B total) does not fit on a single 24GB card under any current quantisation; AWQ alone is roughly 70 GB.

Does it fit? Quantisation footprint

Mixtral 8x7B holds 46.7 billion parameters across eight feed-forward expert blocks plus shared attention. At FP16, weights alone are 87 GB — well outside any 24GB envelope. The story changes dramatically with INT4 weight-only quantisation, which compresses each expert independently without needing to retrain the router. AWQ Marlin kernels keep dequantisation fused into the matmul, so the latency penalty is small relative to FP8.

Model	Format	Weights on disk	Resident on 4090	Fits 24GB?
Mixtral 8x7B	FP16	87.0 GB	—	No
Mixtral 8x7B	FP8	43.5 GB	—	No
Mixtral 8x7B	AWQ INT4	14.0 GB	16.5 GB w/ KV	Yes
Mixtral 8x7B	GGUF Q4_K_M	15.6 GB	18.0 GB w/ KV	Yes (tight)
Mixtral 8x7B	GGUF Q5_K_M	19.4 GB	21.5 GB w/ KV	Yes, no batching headroom
Mixtral 8x22B	FP16	282 GB	—	No
Mixtral 8x22B	AWQ INT4	~70 GB	—	No
Mixtral 8x22B	GGUF Q2_K	~52 GB	—	No

The practical message: AWQ Marlin is the only sensible production path on a single 4090. GGUF works for hobby use through llama.cpp but does not benefit from the same fused kernels and tops out around 72 t/s decode. For 8x22B you will need at least two 4090s with tensor parallelism, or step up to an H100 80GB; see the 4090 vs 5090 comparison for the case where 32GB starts mattering.

Methodology and test rig

All numbers below are measured on a stock RTX 4090 24GB Founders Edition at 450W TDP, no overclock, with PCIe 4.0 x16 link. Host platform is a Ryzen 9 7950X with 64 GB DDR5-5600 and a Samsung 990 Pro 2TB NVMe; OS is Ubuntu 24.04 LTS, NVIDIA driver 560.x, CUDA 12.6. Inference stack is vLLM 0.6.4 with PyTorch 2.5 and FlashAttention 2.6 against the AWQ Marlin kernel path.

Each measurement runs for 60 seconds of warm traffic after a 30-second warm-up, with prompts drawn from a logged production conversation distribution (median 380 input tokens, p99 2,100, mean output 220 tokens). Latency percentiles are reported per-stream, throughput is aggregated across all in-flight sequences. Power draw was sampled via nvidia-smi dmon at 1 Hz; thermals stayed under 78 °C throughout.

python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ \
  --quantization awq_marlin --kv-cache-dtype fp8 \
  --max-model-len 32768 --max-num-seqs 8 \
  --enable-chunked-prefill --enable-prefix-caching \
  --gpu-memory-utilization 0.93

FP8 KV cache halves the per-token KV footprint relative to FP16 with a measured perplexity delta under 0.02. The chunked-prefill flag breaks long prompts into 512-token chunks so they interleave with decode steps, dramatically improving p99 TTFT under mixed traffic. Without it, a 4,000-token prompt freezes the decode loop for ~280 ms.

Decode throughput per batch

Mixtral’s bandwidth-bound decode is set by the size of the two active experts plus the shared attention layers — roughly 14 GB / 1008 GB/s ≈ 14 ms per token in the bandwidth-limited regime, which matches the observed 85 t/s at batch 1 closely. As batch grows, the same expert weights are reused across multiple tokens, so aggregate throughput climbs steeply until KV cache pressure caps it.

Batch	Per-stream t/s	Aggregate t/s	Speed-up vs b=1	VRAM
1	85	85	1.0x	16.5 GB
2	72	144	1.7x	17.5 GB
4	70	280	3.3x	19.0 GB
8	52	420	4.9x	21.8 GB
16	30	480	5.6x (memory cap)	23.4 GB
32	—	OOM	—	KV pressure

Aggregate throughput peaks at 480 t/s around batch 16 — beyond that, FP8 KV cache for thirty-two 4k-context conversations exceeds the remaining 6 GB of headroom. If you need batch 32 you must drop max_model_len to 8k or migrate to the 5090’s 32GB. For comparison, the same Mixtral build on llama.cpp with GGUF Q4_K_M tops out around 230 t/s aggregate at batch 8.

Prefill, TTFT and p50/p99 latency

Prefill on a 4090 is compute-bound, not memory-bound, because every token sees both attention QKV projections and the routed FFN. Mixtral processes prefill at roughly 3,200 input tokens per second on this hardware. That is the number that sets your TTFT for chat-style traffic.

Prompt length	p50 TTFT	p99 TTFT	p50 e2e (200 tok out)	p99 e2e
256 tokens	110 ms	185 ms	2.5 s	3.1 s
1,024 tokens	340 ms	520 ms	2.7 s	3.4 s
2,048 tokens	670 ms	980 ms	3.1 s	4.0 s
4,096 tokens (chunked)	1.31 s	1.84 s	3.7 s	4.9 s
8,192 tokens (chunked)	2.59 s	3.41 s	5.0 s	6.7 s

Prefix caching is the single biggest TTFT win for chatbot workloads where the system prompt is stable: a 1,500-token preamble that previously cost ~470 ms of prefill drops to ~15 ms cache lookup, freeing the prefill compute for the user’s actual message.

Expert routing physics

The Mixtral router selects the top-2 experts per token via a softmax over 8 logits. Empirically, across a diverse prompt mix, the load is roughly uniform — each expert is hit on 23-27% of tokens within any 64-token window. This means two things on a 24GB card.

First, expert offloading to CPU RAM, which works for some Llama 70B FFN-offload tricks where only a small slice of weights is touched per second, does not help here. Within four decode steps almost every expert has been activated; you would PCIe-pump 14 GB of weights every 50 ms, which is roughly five times the 30 GB/s sustained PCIe 4.0 x16 throughput. Throughput collapses to 6-8 t/s.

Second, the per-token compute path is genuinely smaller than a dense 13B — only the attention matrices and the two selected expert FFNs participate. That is why Mixtral 8x7B decode runs faster than Llama 2 13B on the same hardware despite holding 3.5 times more weights. The trade-off is purely memory: you pay for storing the experts you do not use this token, but you might use them next token.

Cross-card comparison

Mixtral 8x7B AWQ batch-1 decode across the realistic candidate stack, all using the same vLLM 0.6.4 + AWQ Marlin recipe (3090 falls back to the older AWQ kernel because Marlin needs Ada or newer):

GPU	VRAM	Bandwidth	Mixtral 8x7B b=1	Mixtral 8x7B b=8 agg	Notes
RTX 5060 Ti 16GB	16	448 GB/s	does not fit	—	Need at least 16.5 GB for AWQ + KV
RTX 5080 16GB	16	960 GB/s	does not fit	—	Same constraint
RTX 3090 24GB	24	936 GB/s	62 t/s	290 t/s	No FP8 path, AWQ kernel only
RTX 4090 24GB	24	1008 GB/s	85 t/s	420 t/s	Marlin + FP8 KV
RTX 5090 32GB	32	1792 GB/s	148 t/s	730 t/s	Headroom for batch 32
H100 80GB	80	3350 GB/s	175 t/s	1100 t/s	Trivially serves 8x22B too

The 4090 is the cheapest card on which Mixtral 8x7B is comfortably production-grade. The 3090 works but loses ~27% on decode and lacks FP8 KV. The 5090 doubles bandwidth and unlocks batch 32 with 8k context. For 8x22B you need an H100 or two 4090s with tensor parallelism — see Llama 70B INT4 on 4090 for the analogous fit story.

Mixtral vs Llama 70B vs Qwen 14B

The interesting question is rarely “which Mixtral?” but “should I pick Mixtral at all?” Compared to its peers on a single 4090:

Metric	Mixtral 8x7B AWQ	Llama 3.1 70B AWQ	Qwen 2.5 14B AWQ
Resident VRAM	16.5 GB	21.0 GB	9.5 GB
Decode b=1	85 t/s	23 t/s	135 t/s
Aggregate b=8	420 t/s	72 t/s	620 t/s
MMLU 5-shot	70.6	86.0	79.7
HumanEval	40.2	80.5	79.4
Multilingual MGSM	71.3	89.2	76.8
Active experts / params	2 of 8 / 12.9B	dense 70B	dense 14B

Qwen 2.5 14B beats Mixtral 8x7B on essentially every quality dimension at a smaller VRAM footprint and higher throughput. Llama 70B INT4 wins on MMLU and code at the cost of 4x slower decode. The honest verdict is that Mixtral 8x7B’s role today is narrow: legacy fine-tunes, router-conditioned downstream tasks, or the specific case where you need the MoE inductive bias for retrieval-augmented multilingual work. Most fresh deployments should default to Qwen 14B.

Production gotchas

AWQ Marlin requires Ada or newer. If you mix 4090s and 3090s in a fleet, the 3090 falls back to the slower legacy AWQ kernel and your routing becomes unbalanced. Pin Mixtral traffic to Ada-class cards.
FP8 KV cache breaks for some sampling configs. Setting top_p below 0.05 with FP8 KV occasionally produces NaN logits on Mixtral specifically; fall back to FP16 KV (and lose ~3 t/s) for sub-1% samplers.
Expert imbalance under adversarial prompts. Repetitive prompts can route 70%+ of tokens through a single expert pair, dropping decode to 60 t/s. Real production traffic doesn’t trigger this, but synthetic benchmarks sometimes do.
32k context is a trap. Mixtral was trained at 32k but quality above 16k drifts noticeably. Cap max_model_len to 16384 unless you have measured your domain-specific recall above that.
Tokeniser mismatch with HF Transformers. The Mistral-format tokeniser sometimes ships with subtly different special-token handling than the HF version; lock the tokeniser file alongside the weights.
Cold-start is 12 seconds. AWQ dequantisation of all eight experts on first load takes longer than dense models. Pre-warm before opening the load balancer.
Watch the L2 cache stat. The 4090’s 72 MB L2 helps Mixtral more than it helps dense models because attention reuses the same KV blocks; nvidia-smi nvlink is irrelevant on consumer cards but L2 hit rate on dcgmi dmon tells you whether your batching is degenerate.

Verdict: when to pick Mixtral on a 4090

Pick Mixtral 8x7B on a 4090 when you have an existing investment in Mixtral fine-tunes, when your traffic is genuinely multilingual and you have measured Qwen lagging, or when you need the slightly higher batch-1 latency floor for a low-concurrency interactive workload (85 t/s is comfortable for streaming). Skip it for new builds — pick Qwen 14B for general SaaS, Llama 3 8B FP8 for raw throughput, or Llama 70B INT4 when answer fidelity wins over latency. The full deployment recipe lives in our vLLM setup and AWQ quantisation guide.

Mixtral 8x7B on a single 4090

85 t/s at batch one, 480 t/s aggregate, 14 GB AWQ weights. UK dedicated hosting, predictable monthly billing, no token-metered API surprises.

Order the RTX 4090 24GB

RTX 4090 24GB Mixtral Benchmark: 8x7B Fits, 8x22B Does Not

Contents

Does it fit? Quantisation footprint

Methodology and test rig

Decode throughput per batch

Prefill, TTFT and p50/p99 latency

Expert routing physics

Cross-card comparison

Mixtral vs Llama 70B vs Qwen 14B

Production gotchas

Verdict: when to pick Mixtral on a 4090

Mixtral 8x7B on a single 4090

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 4090 24GB Mixtral Benchmark: 8x7B Fits, 8x22B Does Not

Contents

Does it fit? Quantisation footprint

Methodology and test rig

Decode throughput per batch

Prefill, TTFT and p50/p99 latency

Expert routing physics

Cross-card comparison

Mixtral vs Llama 70B vs Qwen 14B

Production gotchas

Verdict: when to pick Mixtral on a 4090

Mixtral 8x7B on a single 4090

Need a Dedicated GPU Server?

gigagpu

Related Articles

Gemma 2 9B on RTX 5080: Performance Benchmark & Cost, Category: Benchmarks, Slug: gemma-2-9b-on-rtx-5080-benchmark, Excerpt: Gemma 2 9B benchmarked on RTX 5080: 48.8 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Mistral 7B on RTX 4060: Performance Benchmark & Cost, Category: Benchmarks, Slug: mistral-7b-on-rtx-4060-benchmark, Excerpt: Mistral 7B benchmarked on RTX 4060: 22.0 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Stable Diffusion XL on RTX 5080: Images/sec & VRAM Usage, Category: Benchmarks, Slug: sdxl-on-rtx-5080-benchmark, Excerpt: Stable Diffusion XL benchmarked on RTX 5080: 4.8 it/s, 9.6 images/min at 1024×1024, VRAM usage, and cost per 1K images., Internal links: 8 –>

Qwen 2.5 7B on RTX 5090: Performance Benchmark & Cost, Category: Benchmarks, Slug: qwen-2.5-7b-on-rtx-5090-benchmark, Excerpt: Qwen 2.5 7B benchmarked on RTX 5090: 92.8 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?