Phi-3-mini (3.8B parameters) is Microsoft’s compact reasoning model. On the RTX 5060 Ti 16GB at our hosting it delivers extreme concurrency – this is where the card really shines for high-volume workloads.
Contents
Fit
| Precision | Weights | KV Cache Room |
|---|---|---|
| FP16 / BF16 | ~8 GB | ~8 GB – huge for a small model |
| FP8 | ~4 GB | ~12 GB |
| AWQ INT4 | ~2.5 GB | ~13 GB |
Phi-3-mini is VRAM-abundant on 16 GB – the card can host 30-60+ concurrent short-context users or a single 128k context session.
Deployment
python -m vllm.entrypoints.openai.api_server \
--model microsoft/Phi-3.5-mini-instruct \
--dtype bfloat16 \
--max-model-len 128000 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching
Phi-3.5-mini extends context to 128k natively. Use BF16 (not FP16) – the model was trained with BF16.
Performance
| Batch | Aggregate t/s |
|---|---|
| 1 | ~135 |
| 8 | ~720 |
| 16 | ~1,100 |
| 32 | ~1,400 |
| 64 | ~1,550 |
128k Context
Per-sequence KV cache at 128k context for Phi-3-mini: ~8 GB FP16, ~4 GB FP8. The card can host 1-2 concurrent 128k sessions alongside the base model. For heavy long-context multi-user use step up.
Ideal Use Cases
- High-QPS classification or tagging (20k+ decisions/hour)
- Lightweight chat with many concurrent users
- Structured output extraction from documents
- Routing layer before hitting a larger model
- Intent detection and query understanding
- Content moderation decisions
For workloads needing quality above Phi-mini, step up to Mistral 7B or Llama 3 8B on the same card.
High-Throughput Compact LLM
Phi-3-mini at massive concurrency. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: Phi-3-mini benchmark, monthly cost, classification use case.