Codestral 22B INT4 on the RTX 5060 Ti 16GB at our hosting is a tight fit. Here is whether it is economically viable at this tier.
Contents
Tight Fit
Codestral 22B AWQ INT4 uses ~13 GB of 16 GB. KV cache room is limited: ~2-3 GB which fits 2-4 concurrent users with short context.
FP8 KV cache essential to get even this.
Throughput
- Batch 1: ~32 t/s
- Batch 4: ~110 t/s aggregate (cannot reliably go higher without OOM)
Monthly Capacity
At 50% util on batch 4:
- Output tokens: ~140M/month
- Input tokens (3:1): ~420M/month
- Blended: ~560M tokens/month
Much lower volume than 7-14B alternatives on same card.
API
Mistral Codestral API: ~$1/M tokens blended (coding models priced above chat models).
Equivalent API cost for 560M tokens: ~$560/month. Close to dedicated cost at ~£300 (~$380).
Recommendation
If Codestral is specifically your target (ecosystem fit, fine-tune base, specific licence), 5060 Ti works at low volume. For production at volume:
- Prefer Qwen Coder 14B AWQ on 5060 Ti – comparable code quality, 2-3x the concurrency, similar cost
- Or step up to RTX 3090 24GB for Codestral FP8 with real concurrency
For most teams at the 5060 Ti tier, Qwen Coder 14B is the better economic choice on same card.
Right-Tier Coding Models
Match model size to GPU tier for the best economics. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: Codestral fit analysis, Codestral full guide.