Mistral Small 3 (23.6B parameters, January 2025, Apache 2.0) is the most quality-dense open-weight model that still fits a single RTX 4090 24GB. As AWQ INT4 it lands at roughly 13.4 GB of weights, leaves 32k of native context comfortably in budget, and posts MMLU 81, HumanEval 84.8 and GSM8K 91 – eclipsing the older Mixtral 8x7B on every dimension whilst using a quarter of the VRAM. This guide on our UK dedicated GPU hosting covers the architecture choices Mistral made (56 transformer layers, 8-way GQA, 128 head_dim), the precise VRAM math that makes the fit work, throughput across batches, the seven gotchas that bite production teams in their first week, and where Small 3 sits in the decision tree against Qwen 14B, Codestral 22B, and Llama 3.1 70B INT4.
Contents
- Mistral Small 3 in brief
- Architecture, GQA and KV math
- VRAM accounting across precisions
- Throughput, latency and batch behaviour
- Quality benchmarks vs alternatives
- vLLM deployment and code
- Fine-tuning notes and LoRA economics
- Production gotchas (seven items)
- When to pick this over alternatives
- Verdict
Mistral Small 3 in brief
Mistral Small 3 (full name Mistral-Small-24B-Instruct-2501) is Mistral’s January 2025 mid-tier flagship and the model the company itself markets as “the largest model that runs comfortably on a 32 GB consumer GPU at FP8” – which on a 24 GB card translates to “AWQ INT4 only, but fits cleanly with KV headroom.” It is dense, not MoE, with 23.6B parameters spread over 56 transformer layers, the Tekken tokenizer (131K vocabulary, designed for code-heavy multilingual training), and a native 32,768-token context window. The licence is Apache 2.0 – the cleanest possible terms for commercial deployment, with no acceptable-use clauses that interfere with normal SaaS or B2B work.
Where it sits in the family matters. Mistral 7B v0.3 is the entry point; Mistral Nemo 12B is the natural step up; Small 3 24B replaces the older Mistral Small 22B (Sept 2024) as the new flagship of the consumer-deployable tier; Codestral 22B is the dedicated coding sibling; and Mixtral 8x7B is now effectively legacy because Small 3 outperforms it on every quality benchmark at a quarter of the VRAM. For teams building serious production systems on the RTX 4090 24GB, Small 3 is the new default.
Architecture, GQA and KV math
Small 3 is a 56-layer transformer with 32 query heads, 8 key/value heads (a 4:1 GQA ratio), 128-dimensional heads, an SwiGLU MLP at 12,288 intermediate dim, and RMSNorm. The 4:1 GQA ratio is more aggressive than Llama 3.1 8B’s 4:1 but less than Qwen 2.5 7B’s 7:1; in practice this means the KV cache is moderately heavier per token than Qwen but much lighter than the no-GQA Phi-3-medium baseline. The architecture choices reflect Mistral’s empirical bias towards depth over width: 56 layers (vs Qwen 14B’s 48 or Qwen 32B’s 64) at modest hidden dim 5,120 – cheaper to serve per parameter than wider alternatives.
KV cache per token at FP16: 8 KV heads x 128 dim x 56 layers x 2 (K+V) x 2 bytes = 229 KB. At FP8 KV: 114.7 KB. So 32k tokens = 3.67 GB FP8 for one sequence, on top of 13.4 GB AWQ weights = 17.1 GB used, with roughly 6 GB free for activations, CUDA workspace and paged-attention block overhead. That headroom is enough for batch 4-8 short-context concurrency or one user at the full 32k.
The Tekken tokenizer matters operationally. It is roughly 30% more efficient than the Llama 3 tokenizer on European languages and 15% more efficient on code, which means the same 32k context window holds materially more text. For translation pipelines and code assistants that translates directly to higher real throughput per second of wall-clock time.
VRAM accounting across precisions
| Component | FP16 | FP8 (W8A8) | AWQ INT4 |
|---|---|---|---|
| Weights | 47.2 GB | 23.6 GB | 13.4 GB |
| KV @ 8k FP8, 1 seq | 0.92 GB | 0.92 GB | 0.92 GB |
| KV @ 32k FP8, 1 seq | 3.67 GB | 3.67 GB | 3.67 GB |
| CUDA graphs + workspace | 1.6 GB | 1.6 GB | 1.6 GB |
| vLLM scheduler + activations | 1.4 GB | 1.4 GB | 1.4 GB |
| Total at batch 1, 32k | OOM | OOM (30.3 GB) | 20.1 GB |
| Headroom under 24 GB | n/a | n/a | 3.9 GB |
The headline: FP8 weights alone (23.6 GB) leave essentially zero room for KV, so FP8 needs a 32 GB card (RTX 5090 32GB or RTX 6000 Ada 48GB). On the 4090 it is AWQ INT4 or nothing. AWQ at 13.4 GB plus 32k FP8 KV at 3.67 GB plus vLLM overhead leaves about 4 GB for paged-attention block growth and small batch concurrency.
| Concurrency budget (AWQ + FP8 KV) | Avg context | KV total | Verdict |
|---|---|---|---|
| 1 seq | 32k | 3.67 GB | Comfortable, ~3.9 GB free |
| 4 seqs | 4k | 1.84 GB | Comfortable |
| 8 seqs | 4k | 3.68 GB | Comfortable |
| 8 seqs | 8k | 7.36 GB | Tight – cap at max-num-seqs 8 |
| 16 seqs | 2k | 3.68 GB | Works in eager mode |
Throughput, latency and batch behaviour
Single 4090 24GB, vLLM 0.6.3, AWQ Marlin kernels, FP8 KV cache, 555.x driver, CUDA 12.4:
| Workload | Aggregate t/s | p95 TTFT | p95 inter-token |
|---|---|---|---|
| 1 user, 1k prompt + 256 out | 85 | 320 ms | 12 ms |
| 1 user, 4k prompt + 512 out | 82 | 610 ms | 13 ms |
| 1 user, 16k prompt + 256 out | 76 | 1.6 s | 14 ms |
| Batch 4 active, 4k prompts | 240 | 880 ms | 17 ms |
| Batch 8 active, 4k prompts | 410 | 1.4 s | 20 ms |
| Prefill, 8k prompt batch 1 | 2,050 t/s prefill | n/a | n/a |
For an apples-to-apples anchor: Llama 3.1 8B FP8 lands at 195 t/s batch 1 on the same hardware – roughly 2.3x Small 3’s 85 t/s decode, which is the expected scaling for a model 3x larger. At batch 8, Small 3 aggregate of 410 t/s is a workable production envelope for a small SaaS; for 12-engineer coding teams the per-stream 80+ t/s is well above reading speed and lets you serve interactive chat with sub-second time-to-first-token even on 4k prompts.
Quality benchmarks vs alternatives
| Benchmark | Mistral Small 3 24B | Mixtral 8x7B | Qwen 2.5 14B | Llama 3.1 70B (ref) |
|---|---|---|---|---|
| MMLU | 81.0 | 70.6 | 79.7 | 83.6 |
| HumanEval | 84.8 | 40.2 | 86.6 | 80.5 |
| GSM8K | 91.0 | 61.1 | 90.2 | 95.1 |
| MATH | 54.0 | 28.4 | 55.6 | 68.0 |
| MT-Bench | 8.6 | 8.3 | 8.4 | 8.95 |
| VRAM AWQ INT4 | 13.4 GB | 26 GB | 9 GB | 40 GB |
| Fits single 4090? | Yes | No | Yes | No (needs 2x) |
| Single-card t/s (INT4) | 85 | n/a | 135 | n/a |
Three things stand out. First, Small 3 is meaningfully better than Mixtral 8x7B on every quality benchmark and runs on half the VRAM – Mixtral 8x7B is now effectively obsolete for new deployments. Second, Qwen 2.5 14B narrowly beats Small 3 on HumanEval (86.6 vs 84.8) and is broadly comparable on MMLU and GSM8K, while running 60% faster – so for pure quality-per-dollar Qwen 14B is often the better pick, and Small 3 wins when you specifically want Mistral’s writing style or European-language strength. Third, Llama 3.1 70B INT4 is the next quality step up but needs two 4090s in tensor-parallel.
vLLM deployment and code
The recommended AWQ INT4 launch for single-tenant 32k:
pip install "vllm>=0.6.3" autoawq
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-Small-24B-Instruct-2501-AWQ \
--quantization awq_marlin \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--max-num-seqs 8 \
--gpu-memory-utilization 0.94 \
--enable-prefix-caching \
--port 8000
Flag rationale: --quantization awq_marlin selects the Marlin INT4 GEMM kernels (SM 8.0+ required, the 4090 is SM 8.9). --kv-cache-dtype fp8 halves KV memory at negligible quality cost – vital because at 32k the KV is 3.67 GB. --gpu-memory-utilization 0.94 pushes vLLM to use the full envelope; default 0.90 leaves 1 GB unused which constrains batch concurrency. --enable-prefix-caching reuses computed prefixes across requests with identical system prompts – critical for RAG.
For high-concurrency short-context workloads (chat with 2k average prompts), raise --max-num-seqs to 16 and you can sustain ~600 t/s aggregate with 16 active streams. The KV math is the only constraint.
A simple async benchmark loop:
import asyncio, time
from openai import AsyncOpenAI
client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="x")
async def hit(prompt):
t0 = time.time()
r = await client.chat.completions.create(
model="mistralai/Mistral-Small-24B-Instruct-2501",
messages=[{"role":"user","content":prompt}],
max_tokens=256, temperature=0.0)
dt = time.time() - t0
return r.usage.completion_tokens / dt
async def main():
prompts = ["Summarise GQA in 200 words"] * 8
rates = await asyncio.gather(*[hit(p) for p in prompts])
print(f"per-stream t/s: {sum(rates)/len(rates):.1f}")
print(f"aggregate t/s: {sum(rates):.1f}")
asyncio.run(main())
Alternatives to vLLM at this size: TGI 2.x with AWQ is within 8% of vLLM throughput; SGLang is faster on tool-heavy agentic loops thanks to RadixAttention prefix sharing. Our vLLM setup guide covers the driver and CUDA prerequisites, and the AWQ quantisation guide explains the Marlin kernel specifics.
Fine-tuning notes and LoRA economics
Mistral Small 3 fine-tunes cleanly with QLoRA at rank 16-32 on a single 4090. A 50,000-sample SFT pass at sequence length 2048, batch 1, gradient accumulation 16, takes about 18 hours – longer than the 14B class because of the 24B parameter footprint. For sequence length 4k you must drop the rank to 16. Full fine-tuning is firmly out of reach on a single 4090: gradients plus AdamW optimiser states would require 95+ GB. Use FSDP across two 4090s, or rent an H100 80GB for a serious full SFT run. See our QLoRA guide and the LoRA fine-tuning analysis.
Production gotchas (seven items)
- FP8 will not fit on a 24 GB card: weights alone are 23.6 GB. Do not chase FP8 deployment on the 4090; it requires a 32 GB 5090 or 48 GB A6000. AWQ INT4 is the only single-card path on Ada Lovelace 24 GB silicon.
- Tekken tokenizer encoding pitfalls: the Mistral Small 3 tokenizer differs from Llama 3 in how it handles trailing whitespace and special tokens. Tokenizer mismatches between training and serving cause silent quality regressions – always use
AutoTokenizer.from_pretrained("mistralai/Mistral-Small-24B-Instruct-2501")end-to-end. - 32k context is real but tight under concurrency: at 8 concurrent 8k contexts you hit 7.4 GB of KV alone. Either drop max-num-seqs to 4 or shorten the per-tenant context budget. Watch for
NoFreeBlockserrors in vLLM logs. - Prefix caching contention: with only 4 GB of free VRAM after weights+KV, prefix caching can evict hot system prompts under bursty traffic. Pin them by warming a sidecar request every 60 seconds, or disable prefix caching if you see eviction log spam.
- Cold start is 35-40 seconds: AWQ weight load from NVMe is 18 seconds, eager-mode CUDA warmup another 15. Pre-warm before exposing to traffic.
- Power and thermal headroom: at sustained batch 8 the 4090 will draw 440-450 W. In a poorly ventilated chassis it will throttle. Cap to 410 W with
nvidia-smi -pl 410for stable numbers – throughput loss is under 5%. See the power draw analysis. - Marlin kernel SM requirement: AWQ Marlin needs SM 8.0+. The 4090 is SM 8.9 so it works, but mixed fleets with 3060/3070 boxes need branch logic in the launch script.
When to pick this over alternatives
Pick Mistral Small 3 over Mixtral 8x7B on every workload – it wins on every benchmark at half the VRAM and Mixtral is now legacy. Pick it over Qwen 2.5 14B when you specifically want Mistral’s writing style, European-language strength via the Tekken tokenizer, or strict Apache 2.0 licensing – otherwise Qwen 14B is faster and broadly comparable on quality. Pick it over Codestral 22B for general-purpose work; pick Codestral when 80% of your traffic is code.
Step up to Qwen 2.5 32B AWQ if MMLU 83+ is a hard requirement and you accept the 65 t/s ceiling with eager-mode constraints. Step up to Llama 3.1 70B INT4 across two 4090s when you need MATH 68 or full multilingual coverage. Step down to Mistral Nemo 12B when 50 t/s in the 12B class is enough and you need higher concurrency.
Verdict
Mistral Small 3 24B AWQ on the RTX 4090 24GB is the most capable single-card dense Mistral deployment in the 2026 landscape. 85 t/s decode, 32k native context, 8-stream concurrency on short prompts, MMLU 81/HumanEval 84.8/GSM8K 91, Apache 2.0 licence, and a Tekken tokenizer that gives real per-token efficiency on European languages and code. The only constraints are AWQ-only on a 24 GB card, and a tight 4 GB headroom that disciplines you to keep concurrency moderate. For a 12-engineer team needing strong general-purpose AI without API costs, a small SaaS with European users, or a research lab who need a dense model that will fine-tune cleanly, this is a default 4090 choice. For higher throughput pick Qwen 2.5 14B; for higher quality pick Qwen 32B AWQ or step to two cards for a 5090 32GB.
Run Mistral Small 3 on a UK RTX 4090
AWQ INT4, 32k context, 85 tokens per second. UK dedicated hosting from Manchester.
Order the RTX 4090 24GBSee also: Mistral Nemo 12B, Mistral 7B, Codestral 22B, Qwen 2.5 14B, vs OpenAI API cost, monthly hosting cost, tokens per watt.