Mistral Small 3 (24B parameters) hits a productive size bracket: stronger than 7B models on reasoning, cheaper to host than 70B class models, fits a single 24-32 GB GPU. On our dedicated GPU hosting it is a frequent choice for teams who need quality without multi-GPU complexity.
Contents
VRAM
| Precision | Weights | Fits On |
|---|---|---|
| FP16 | ~48 GB | 96 GB card, multi-GPU |
| FP8 | ~24 GB | 32 GB single card |
| AWQ INT4 | ~14 GB | 16 GB+ card |
| GPTQ INT4 | ~14 GB | 16 GB+ card |
GPU Options
- RTX 4060 Ti 16GB: AWQ INT4 tight, short context
- RTX 3090 24GB: AWQ INT4 comfortable
- RTX 5090 32GB: FP8 native, best single-GPU option
- Intel Arc Pro B70 32GB: AWQ or FP8 via OpenVINO
Deployment
FP8 on a 5090:
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-Small-3-24B-Instruct-FP8 \
--quantization fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching
Mistral Small 3 supports 32k context natively. Configure max-model-len accordingly – this is one of its selling points.
Use Cases
Mistral Small 3 fits workloads where:
- 7B models underperform on reasoning or coding
- 70B models are overkill for cost
- 32k context matters (long documents, multi-turn chats)
- European data residency matters (Mistral is French)
Throughput on a 5090 FP8: ~75 t/s at batch 1, ~620 t/s at batch 16 aggregate.
Mistral Small 3 on UK Dedicated
FP8 or INT4 preconfigured on the GPU class that matches your budget.
Browse GPU ServersSee Mistral Nemo 12B for the smaller variant and Codestral 22B for Mistral’s coding model.