Phi-3-medium (14B) is Microsoft’s mid-size reasoning model with strong structured output and math capability. On the RTX 5060 Ti 16GB at our dedicated hosting it fits via AWQ or tight FP8.
Contents
Fit
| Precision | Weights | Fits |
|---|---|---|
| FP16 | ~28 GB | No |
| FP8 | ~14 GB | Tight |
| AWQ INT4 | ~8 GB | Comfortable |
Deployment
python -m vllm.entrypoints.openai.api_server \
--model microsoft/Phi-3-medium-4k-instruct \
--quantization awq \
--trust-remote-code \
--max-model-len 4096 \
--gpu-memory-utilization 0.92
Phi-3-medium-4k has short native context. For longer contexts use the 128k variant which has different VRAM demands.
Performance
- AWQ batch 1 decode: ~45 t/s
- AWQ batch 8 aggregate: ~250 t/s
- TTFT 1k prompt: ~280 ms
Strengths
Phi-3-medium is particularly strong on:
- Reasoning and math (MATH, GSM8K)
- Following complex multi-step instructions
- Structured output (JSON, schema-constrained generation)
- Code in Python
Weaker on:
- Open-ended creative writing
- Multilingual tasks
- Broad world knowledge relative to Llama 3 70B
vs Qwen 14B
For general-purpose 14B workloads, Qwen 2.5 14B is broader. For strict instruction-following and math, Phi-3-medium edges ahead. Both fit the same card via AWQ.
Phi-3 Reasoning Hosting
Compact 14B reasoning on Blackwell 16GB. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: Phi-3-mini.