Mistral Small 3 (24B parameters) is a strong mid-size model but pushes the 16 GB boundary. The RTX 5060 Ti 16GB hosts it only at aggressive quantisation on our hosting. Here is whether it’s viable for your use case.
Contents
Fit
| Precision | Weights | Fits 16GB |
|---|---|---|
| FP16 | ~48 GB | No |
| FP8 | ~24 GB | No |
| AWQ INT4 | ~14 GB | Tight, works with FP8 KV cache |
| GPTQ INT4 | ~14 GB | Tight but works |
| Q3_K_M GGUF | ~11 GB | Comfortable with modest concurrency |
Deployment
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-Small-3-24B-Instruct-AWQ \
--quantization awq \
--max-model-len 8192 \
--gpu-memory-utilization 0.93 \
--kv-cache-dtype fp8
--kv-cache-dtype fp8 halves KV cache memory – essential at this tight fit. --max-model-len 8192 keeps KV cache per sequence manageable; pushing to 32k drops concurrency to 1.
Performance
- AWQ INT4 batch 1 decode: ~38 t/s
- AWQ INT4 batch 4 aggregate: ~135 t/s
- AWQ INT4 batch 8 aggregate: ~220 t/s
- TTFT 1k prompt: ~320 ms
Concurrency
At 30 t/s/user SLA:
- Comfortable: 2-3 concurrent users
- Push: 4-6 with latency degradation
- Breaks: 8+ (KV cache evictions)
Verdict
Runs but without comfortable concurrency. If you need Mistral Small 3 in production, step up to RTX 5090 32GB or 3090 24GB. The 5060 Ti works for single-user dev/test or low-traffic internal tools.
For mid-tier workloads on the 5060 Ti, prefer Qwen 2.5 14B or Mistral Nemo 12B – both fit more comfortably and deliver similar quality on most tasks.
Sized-Right Mistral Hosting
Pick the variant that matches your card. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: Mistral Small 3 full deployment guide, max model size guide.