Microsoft’s Phi-3 family is built for exactly the hardware profile the RTX 5060 Ti 16GB offers: small-to-mid-sized dense models with strong benchmarks and compact footprints. A single Blackwell GB206 card can run Phi-3-mini at 285 tokens per second, Phi-3-small comfortably at BF16, and Phi-3-medium 14B at AWQ with room to spare. This guide covers each variant, its VRAM envelope on a UK Gigagpu node, and where Phi-3 outperforms heavier alternatives.
Contents
- The Phi-3 family
- VRAM and precision
- Throughput numbers
- Where Phi-3 wins
- Deployment
- Choosing a variant
The Phi-3 family
Phi-3 is Microsoft’s “small language model” line: 3.8B, 7B and 14B parameter dense transformers trained on heavily curated data. Quality per parameter is unusually high – Phi-3-mini routinely beats larger 7B models on reasoning benchmarks – which makes the family ideal for latency-sensitive or high-concurrency workloads.
| Variant | Params | Context | MMLU (5-shot) | HumanEval |
|---|---|---|---|---|
| Phi-3-mini-4k | 3.8B | 4,096 | 68.8 | 59.1 |
| Phi-3-mini-128k | 3.8B | 131,072 | 68.1 | 58.5 |
| Phi-3-small-8k | 7B | 8,192 | 75.3 | 61.0 |
| Phi-3-medium-4k | 14B | 4,096 | 78.2 | 62.2 |
| Phi-3-medium-128k | 14B | 131,072 | 77.6 | 58.5 |
VRAM and precision
Blackwell’s 5th-gen tensor cores run FP8 natively, so Phi-3-mini and Phi-3-small prefer FP8 W8A8. Phi-3-medium is better served by AWQ int4 to leave room for a long context cache.
| Variant | Precision | Weights | KV (4k ctx) | Total VRAM |
|---|---|---|---|---|
| Phi-3-mini 3.8B | FP8 | 3.9 GB | 0.7 GB | 4.9 GB |
| Phi-3-mini 3.8B | BF16 | 7.7 GB | 1.4 GB | 9.4 GB |
| Phi-3-small 7B | FP8 | 7.1 GB | 1.1 GB | 8.6 GB |
| Phi-3-small 7B | BF16 | 14.2 GB | 1.1 GB | 15.5 GB |
| Phi-3-medium 14B | AWQ int4 | 8.3 GB | 3.3 GB | 12.1 GB |
| Phi-3-medium 14B | FP8 | 14.8 GB | – | OOM |
Throughput numbers
Measured with vLLM 0.6 on a warm engine, prompt 256 tokens, output 256 tokens, BS as shown.
| Variant | BS=1 t/s | BS=8 agg t/s | BS=32 agg t/s | TTFT |
|---|---|---|---|---|
| Phi-3-mini FP8 | 285 | 1,410 | 3,400 | 22 ms |
| Phi-3-mini BF16 | 190 | 920 | 2,100 | 28 ms |
| Phi-3-small FP8 | 148 | 880 | 1,850 | 36 ms |
| Phi-3-medium AWQ | 78 | 410 | 780 | 58 ms |
Where Phi-3 wins
Phi-3 shines in bounded, well-scoped tasks where throughput matters more than open-ended generation quality:
- Classification and routing – intent detection, content moderation, ticket triage. Phi-3-mini hits 1,400+ classifications/s with structured output.
- Short chat turns – FAQ bots, status queries, confirmation dialogues. 0.6-0.9s end-to-end responses.
- Function calling – Phi-3-small handles OpenAI-style tool invocation reliably.
- Agent subtasks – Phi-3-mini is excellent as a cheap inner loop under a bigger orchestrator.
- Long-context summarisation – the 128k variants ingest full reports in a single pass.
Deployment
# Phi-3-mini 128k, FP8, high concurrency
docker run -d --gpus all -p 8000:8000 vllm/vllm-openai:v0.6.3 \
--model microsoft/Phi-3-mini-128k-instruct \
--quantization fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.85 \
--enable-prefix-caching
# Phi-3-medium 14B, AWQ, quality-first
docker run -d --gpus all -p 8000:8000 vllm/vllm-openai:v0.6.3 \
--model casperhansen/phi-3-medium-4k-instruct-awq \
--quantization awq_marlin \
--max-model-len 4096 \
--gpu-memory-utilization 0.88
Choosing a variant
- Need raw throughput for short prompts -> Phi-3-mini FP8.
- Need 100k+ context on cheap hardware -> Phi-3-mini-128k.
- Want quality close to Llama 3 8B with lower latency -> Phi-3-small FP8.
- Want Llama 3 70B-tier answers on one card -> Phi-3-medium AWQ.
Run any Phi-3 variant on a dedicated Blackwell GPU
From 285 t/s on Phi-3-mini to 14B reasoning quality on one card. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: Phi-3 mini benchmark, Llama 3 8B benchmark, Qwen 14B benchmark, vLLM setup, prefix caching.