Codestral 22B is Mistral’s purpose-built coding model. On the RTX 5060 Ti 16GB it fits only at aggressive INT4 via our hosting. The fit is tight but viable for specific use cases.
Contents
Fit
| Precision | Weights | Fits 16GB |
|---|---|---|
| FP16 | ~44 GB | No |
| FP8 | ~22 GB | No |
| AWQ INT4 | ~13 GB | Tight, 2-3 GB KV room |
Deployment
python -m vllm.entrypoints.openai.api_server \
--model bartowski/Codestral-22B-v0.1-AWQ \
--quantization awq \
--max-model-len 8192 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.93
FP8 KV cache halves per-sequence cache footprint – essential at this tight fit.
Performance
- AWQ batch 1 decode: ~32 t/s
- AWQ batch 4 aggregate: ~110 t/s
- Cannot sustain batch 8+ without OOM
Concurrency caps at 2-4 users. Fine for small team internal use, not for API serving at volume.
Alternatives
If Codestral is specifically your target (Mistral ecosystem commitment, specific fine-tune), 5060 Ti works for small-scale deployment. For production:
- Qwen Coder 14B AWQ – fits same card with more concurrency headroom, comparable code quality
- RTX 3090 24GB for Codestral at FP8
See full Codestral guide.
Right-Size Your Coding Model
Codestral on Blackwell works but alternatives often fit better. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: monthly cost.