Phi-3-mini is Microsoft’s 3.8B parameter instruction-tuned model with exceptional quality for its size. On the RTX 5060 Ti 16GB at our hosting, it’s the fastest mainstream LLM you can serve.
Contents
Setup
- Model: microsoft/Phi-3-mini-4k-instruct
- 3.8B params, 32 layers, 32 KV heads (no GQA), 96 head dim
- Native context 4k; 128k variant also available
Decode Throughput
| Precision | Weights | t/s |
|---|---|---|
| FP16 | 7.6 GB | 225 |
| FP8 | 3.8 GB | 270 |
| FP8 + FP8 KV | 3.8 GB | 285 |
| AWQ INT4 | 2.6 GB | 310 |
| GGUF Q4_K_M | 2.4 GB | 260 |
Fastest decode of any mainstream instruction model on this card. 285 t/s FP8 is ~2.5x Llama 3 8B.
Prefill
- FP8: 14,000 t/s
- AWQ INT4: 9,500 t/s
Concurrency Scaling
| Users | Total t/s (FP8+FP8 KV) | Per user |
|---|---|---|
| 1 | 285 | 285 |
| 4 | 820 | 205 |
| 8 | 1,250 | 156 |
| 16 | 1,650 | 103 |
| 32 | 1,900 | 59 |
| 64 | 2,000 | 31 |
Aggregate throughput tops 2,000 t/s at batch 64 – this card sustains an enormous amount of Phi-3 traffic.
When to Use Phi-3
- Classification, extraction, routing (small model is enough)
- High-concurrency chatbots with short turns
- Latency-critical paths where 300 t/s buys snappier UX
- Edge / on-device coupling (same weights run locally)
Skip Phi-3 for complex reasoning or long-form writing – Llama 3 8B or Qwen 2.5 14B handle those better.
Phi-3 Mini on Blackwell 16GB
285 t/s solo, 2,000 t/s aggregate. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: monthly cost, Phi-3 guide, classification workloads, concurrent users, FP8 deployment.