RTX 3050 - Order Now
Home / Blog / Model Guides / RTX 4090 24GB for Mistral Small 3 (24B): The Best Mid-Size Dense Model on One Card
Model Guides

RTX 4090 24GB for Mistral Small 3 (24B): The Best Mid-Size Dense Model on One Card

Mistral Small 3 24B fits a single RTX 4090 24GB as AWQ INT4 with 32k context, 85 t/s decode, and beats Mixtral 8x7B on every quality benchmark at a quarter of the VRAM.

Mistral Small 3 (23.6B parameters, January 2025, Apache 2.0) is the most quality-dense open-weight model that still fits a single RTX 4090 24GB. As AWQ INT4 it lands at roughly 13.4 GB of weights, leaves 32k of native context comfortably in budget, and posts MMLU 81, HumanEval 84.8 and GSM8K 91 – eclipsing the older Mixtral 8x7B on every dimension whilst using a quarter of the VRAM. This guide on our UK dedicated GPU hosting covers the architecture choices Mistral made (56 transformer layers, 8-way GQA, 128 head_dim), the precise VRAM math that makes the fit work, throughput across batches, the seven gotchas that bite production teams in their first week, and where Small 3 sits in the decision tree against Qwen 14B, Codestral 22B, and Llama 3.1 70B INT4.

Contents

Mistral Small 3 in brief

Mistral Small 3 (full name Mistral-Small-24B-Instruct-2501) is Mistral’s January 2025 mid-tier flagship and the model the company itself markets as “the largest model that runs comfortably on a 32 GB consumer GPU at FP8” – which on a 24 GB card translates to “AWQ INT4 only, but fits cleanly with KV headroom.” It is dense, not MoE, with 23.6B parameters spread over 56 transformer layers, the Tekken tokenizer (131K vocabulary, designed for code-heavy multilingual training), and a native 32,768-token context window. The licence is Apache 2.0 – the cleanest possible terms for commercial deployment, with no acceptable-use clauses that interfere with normal SaaS or B2B work.

Where it sits in the family matters. Mistral 7B v0.3 is the entry point; Mistral Nemo 12B is the natural step up; Small 3 24B replaces the older Mistral Small 22B (Sept 2024) as the new flagship of the consumer-deployable tier; Codestral 22B is the dedicated coding sibling; and Mixtral 8x7B is now effectively legacy because Small 3 outperforms it on every quality benchmark at a quarter of the VRAM. For teams building serious production systems on the RTX 4090 24GB, Small 3 is the new default.

Architecture, GQA and KV math

Small 3 is a 56-layer transformer with 32 query heads, 8 key/value heads (a 4:1 GQA ratio), 128-dimensional heads, an SwiGLU MLP at 12,288 intermediate dim, and RMSNorm. The 4:1 GQA ratio is more aggressive than Llama 3.1 8B’s 4:1 but less than Qwen 2.5 7B’s 7:1; in practice this means the KV cache is moderately heavier per token than Qwen but much lighter than the no-GQA Phi-3-medium baseline. The architecture choices reflect Mistral’s empirical bias towards depth over width: 56 layers (vs Qwen 14B’s 48 or Qwen 32B’s 64) at modest hidden dim 5,120 – cheaper to serve per parameter than wider alternatives.

KV cache per token at FP16: 8 KV heads x 128 dim x 56 layers x 2 (K+V) x 2 bytes = 229 KB. At FP8 KV: 114.7 KB. So 32k tokens = 3.67 GB FP8 for one sequence, on top of 13.4 GB AWQ weights = 17.1 GB used, with roughly 6 GB free for activations, CUDA workspace and paged-attention block overhead. That headroom is enough for batch 4-8 short-context concurrency or one user at the full 32k.

The Tekken tokenizer matters operationally. It is roughly 30% more efficient than the Llama 3 tokenizer on European languages and 15% more efficient on code, which means the same 32k context window holds materially more text. For translation pipelines and code assistants that translates directly to higher real throughput per second of wall-clock time.

VRAM accounting across precisions

ComponentFP16FP8 (W8A8)AWQ INT4
Weights47.2 GB23.6 GB13.4 GB
KV @ 8k FP8, 1 seq0.92 GB0.92 GB0.92 GB
KV @ 32k FP8, 1 seq3.67 GB3.67 GB3.67 GB
CUDA graphs + workspace1.6 GB1.6 GB1.6 GB
vLLM scheduler + activations1.4 GB1.4 GB1.4 GB
Total at batch 1, 32kOOMOOM (30.3 GB)20.1 GB
Headroom under 24 GBn/an/a3.9 GB

The headline: FP8 weights alone (23.6 GB) leave essentially zero room for KV, so FP8 needs a 32 GB card (RTX 5090 32GB or RTX 6000 Ada 48GB). On the 4090 it is AWQ INT4 or nothing. AWQ at 13.4 GB plus 32k FP8 KV at 3.67 GB plus vLLM overhead leaves about 4 GB for paged-attention block growth and small batch concurrency.

Concurrency budget (AWQ + FP8 KV)Avg contextKV totalVerdict
1 seq32k3.67 GBComfortable, ~3.9 GB free
4 seqs4k1.84 GBComfortable
8 seqs4k3.68 GBComfortable
8 seqs8k7.36 GBTight – cap at max-num-seqs 8
16 seqs2k3.68 GBWorks in eager mode

Throughput, latency and batch behaviour

Single 4090 24GB, vLLM 0.6.3, AWQ Marlin kernels, FP8 KV cache, 555.x driver, CUDA 12.4:

WorkloadAggregate t/sp95 TTFTp95 inter-token
1 user, 1k prompt + 256 out85320 ms12 ms
1 user, 4k prompt + 512 out82610 ms13 ms
1 user, 16k prompt + 256 out761.6 s14 ms
Batch 4 active, 4k prompts240880 ms17 ms
Batch 8 active, 4k prompts4101.4 s20 ms
Prefill, 8k prompt batch 12,050 t/s prefilln/an/a

For an apples-to-apples anchor: Llama 3.1 8B FP8 lands at 195 t/s batch 1 on the same hardware – roughly 2.3x Small 3’s 85 t/s decode, which is the expected scaling for a model 3x larger. At batch 8, Small 3 aggregate of 410 t/s is a workable production envelope for a small SaaS; for 12-engineer coding teams the per-stream 80+ t/s is well above reading speed and lets you serve interactive chat with sub-second time-to-first-token even on 4k prompts.

Quality benchmarks vs alternatives

BenchmarkMistral Small 3 24BMixtral 8x7BQwen 2.5 14BLlama 3.1 70B (ref)
MMLU81.070.679.783.6
HumanEval84.840.286.680.5
GSM8K91.061.190.295.1
MATH54.028.455.668.0
MT-Bench8.68.38.48.95
VRAM AWQ INT413.4 GB26 GB9 GB40 GB
Fits single 4090?YesNoYesNo (needs 2x)
Single-card t/s (INT4)85n/a135n/a

Three things stand out. First, Small 3 is meaningfully better than Mixtral 8x7B on every quality benchmark and runs on half the VRAM – Mixtral 8x7B is now effectively obsolete for new deployments. Second, Qwen 2.5 14B narrowly beats Small 3 on HumanEval (86.6 vs 84.8) and is broadly comparable on MMLU and GSM8K, while running 60% faster – so for pure quality-per-dollar Qwen 14B is often the better pick, and Small 3 wins when you specifically want Mistral’s writing style or European-language strength. Third, Llama 3.1 70B INT4 is the next quality step up but needs two 4090s in tensor-parallel.

vLLM deployment and code

The recommended AWQ INT4 launch for single-tenant 32k:

pip install "vllm>=0.6.3" autoawq

python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-Small-24B-Instruct-2501-AWQ \
  --quantization awq_marlin \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --max-num-seqs 8 \
  --gpu-memory-utilization 0.94 \
  --enable-prefix-caching \
  --port 8000

Flag rationale: --quantization awq_marlin selects the Marlin INT4 GEMM kernels (SM 8.0+ required, the 4090 is SM 8.9). --kv-cache-dtype fp8 halves KV memory at negligible quality cost – vital because at 32k the KV is 3.67 GB. --gpu-memory-utilization 0.94 pushes vLLM to use the full envelope; default 0.90 leaves 1 GB unused which constrains batch concurrency. --enable-prefix-caching reuses computed prefixes across requests with identical system prompts – critical for RAG.

For high-concurrency short-context workloads (chat with 2k average prompts), raise --max-num-seqs to 16 and you can sustain ~600 t/s aggregate with 16 active streams. The KV math is the only constraint.

A simple async benchmark loop:

import asyncio, time
from openai import AsyncOpenAI

client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="x")

async def hit(prompt):
    t0 = time.time()
    r = await client.chat.completions.create(
        model="mistralai/Mistral-Small-24B-Instruct-2501",
        messages=[{"role":"user","content":prompt}],
        max_tokens=256, temperature=0.0)
    dt = time.time() - t0
    return r.usage.completion_tokens / dt

async def main():
    prompts = ["Summarise GQA in 200 words"] * 8
    rates = await asyncio.gather(*[hit(p) for p in prompts])
    print(f"per-stream t/s: {sum(rates)/len(rates):.1f}")
    print(f"aggregate t/s: {sum(rates):.1f}")

asyncio.run(main())

Alternatives to vLLM at this size: TGI 2.x with AWQ is within 8% of vLLM throughput; SGLang is faster on tool-heavy agentic loops thanks to RadixAttention prefix sharing. Our vLLM setup guide covers the driver and CUDA prerequisites, and the AWQ quantisation guide explains the Marlin kernel specifics.

Fine-tuning notes and LoRA economics

Mistral Small 3 fine-tunes cleanly with QLoRA at rank 16-32 on a single 4090. A 50,000-sample SFT pass at sequence length 2048, batch 1, gradient accumulation 16, takes about 18 hours – longer than the 14B class because of the 24B parameter footprint. For sequence length 4k you must drop the rank to 16. Full fine-tuning is firmly out of reach on a single 4090: gradients plus AdamW optimiser states would require 95+ GB. Use FSDP across two 4090s, or rent an H100 80GB for a serious full SFT run. See our QLoRA guide and the LoRA fine-tuning analysis.

Production gotchas (seven items)

  • FP8 will not fit on a 24 GB card: weights alone are 23.6 GB. Do not chase FP8 deployment on the 4090; it requires a 32 GB 5090 or 48 GB A6000. AWQ INT4 is the only single-card path on Ada Lovelace 24 GB silicon.
  • Tekken tokenizer encoding pitfalls: the Mistral Small 3 tokenizer differs from Llama 3 in how it handles trailing whitespace and special tokens. Tokenizer mismatches between training and serving cause silent quality regressions – always use AutoTokenizer.from_pretrained("mistralai/Mistral-Small-24B-Instruct-2501") end-to-end.
  • 32k context is real but tight under concurrency: at 8 concurrent 8k contexts you hit 7.4 GB of KV alone. Either drop max-num-seqs to 4 or shorten the per-tenant context budget. Watch for NoFreeBlocks errors in vLLM logs.
  • Prefix caching contention: with only 4 GB of free VRAM after weights+KV, prefix caching can evict hot system prompts under bursty traffic. Pin them by warming a sidecar request every 60 seconds, or disable prefix caching if you see eviction log spam.
  • Cold start is 35-40 seconds: AWQ weight load from NVMe is 18 seconds, eager-mode CUDA warmup another 15. Pre-warm before exposing to traffic.
  • Power and thermal headroom: at sustained batch 8 the 4090 will draw 440-450 W. In a poorly ventilated chassis it will throttle. Cap to 410 W with nvidia-smi -pl 410 for stable numbers – throughput loss is under 5%. See the power draw analysis.
  • Marlin kernel SM requirement: AWQ Marlin needs SM 8.0+. The 4090 is SM 8.9 so it works, but mixed fleets with 3060/3070 boxes need branch logic in the launch script.

When to pick this over alternatives

Pick Mistral Small 3 over Mixtral 8x7B on every workload – it wins on every benchmark at half the VRAM and Mixtral is now legacy. Pick it over Qwen 2.5 14B when you specifically want Mistral’s writing style, European-language strength via the Tekken tokenizer, or strict Apache 2.0 licensing – otherwise Qwen 14B is faster and broadly comparable on quality. Pick it over Codestral 22B for general-purpose work; pick Codestral when 80% of your traffic is code.

Step up to Qwen 2.5 32B AWQ if MMLU 83+ is a hard requirement and you accept the 65 t/s ceiling with eager-mode constraints. Step up to Llama 3.1 70B INT4 across two 4090s when you need MATH 68 or full multilingual coverage. Step down to Mistral Nemo 12B when 50 t/s in the 12B class is enough and you need higher concurrency.

Verdict

Mistral Small 3 24B AWQ on the RTX 4090 24GB is the most capable single-card dense Mistral deployment in the 2026 landscape. 85 t/s decode, 32k native context, 8-stream concurrency on short prompts, MMLU 81/HumanEval 84.8/GSM8K 91, Apache 2.0 licence, and a Tekken tokenizer that gives real per-token efficiency on European languages and code. The only constraints are AWQ-only on a 24 GB card, and a tight 4 GB headroom that disciplines you to keep concurrency moderate. For a 12-engineer team needing strong general-purpose AI without API costs, a small SaaS with European users, or a research lab who need a dense model that will fine-tune cleanly, this is a default 4090 choice. For higher throughput pick Qwen 2.5 14B; for higher quality pick Qwen 32B AWQ or step to two cards for a 5090 32GB.

Run Mistral Small 3 on a UK RTX 4090

AWQ INT4, 32k context, 85 tokens per second. UK dedicated hosting from Manchester.

Order the RTX 4090 24GB

See also: Mistral Nemo 12B, Mistral 7B, Codestral 22B, Qwen 2.5 14B, vs OpenAI API cost, monthly hosting cost, tokens per watt.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?