Table of Contents
GPU Selection for Phi-3
Microsoft’s Phi-3 family includes Mini (3.8B), Small (7B), and Medium (14B) variants that punch well above their weight on reasoning benchmarks. Their compact size makes them an excellent fit for budget dedicated GPU servers. Here is the GPU mapping for Phi-3 hosting:
| Phi-3 Variant | FP16 VRAM | INT4 VRAM | Recommended GPU |
|---|---|---|---|
| Phi-3 Mini (3.8B) | ~7.6 GB | ~2.8 GB | RTX 4060 (8 GB) |
| Phi-3 Small (7B) | ~14 GB | ~5.2 GB | RTX 4060 Ti (16 GB) |
| Phi-3 Medium (14B) | ~28 GB | ~9 GB | RTX 3090 (24 GB) |
| Phi-3 Mini (3.8B) | ~7.6 GB | ~2.8 GB | RTX 3050 (INT4 only) |
Phi-3 Mini at FP16 fits entirely within an RTX 4060’s 8 GB, making it one of the cheapest models to self-host with strong benchmark performance. For a direct comparison with LLaMA 3, read our Phi-3 vs LLaMA 3 8B analysis.
Install and Serve with vLLM
# Install vLLM
pip install vllm
# Serve Phi-3 Mini
python -m vllm.entrypoints.openai.api_server \
--model microsoft/Phi-3-mini-4k-instruct \
--dtype float16 \
--max-model-len 4096 \
--gpu-memory-utilization 0.90 \
--port 8000
# Test the endpoint
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "microsoft/Phi-3-mini-4k-instruct",
"messages": [{"role": "user", "content": "Write a Python function to merge two sorted lists."}],
"max_tokens": 512
}'
For a comparison of serving frameworks, see our vLLM vs Ollama guide.
Quick Start with Ollama
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run Phi-3 Mini
ollama run phi3:mini
# Serve as API
ollama serve &
curl http://localhost:11434/api/generate \
-d '{"model": "phi3:mini", "prompt": "Explain gradient descent in simple terms."}'
Performance Benchmarks
Benchmarked with vLLM, 512-token input, 256-token output. See the tokens-per-second benchmark tool for current data.
| Model | GPU | Precision | Gen tok/s | TTFT |
|---|---|---|---|---|
| Phi-3 Mini | RTX 4060 | FP16 | 118 | 92 ms |
| Phi-3 Mini | RTX 4060 | AWQ 4-bit | 162 | 68 ms |
| Phi-3 Mini | RTX 3090 | FP16 | 135 | 78 ms |
| Phi-3 Small | RTX 4060 Ti | FP16 | 82 | 155 ms |
| Phi-3 Medium | RTX 3090 | AWQ 4-bit | 68 | 195 ms |
Phi-3 Mini at FP16 delivers 118 tok/s on the RTX 4060, which is faster than most 7B models. The 3.8B parameter count means lower memory bandwidth requirements, which directly translates to higher per-token speed.
Optimisation Tips
- Run at FP16 on 8 GB cards. Phi-3 Mini is small enough that quantisation is unnecessary on the RTX 4060.
- Use the 128K context variant (
Phi-3-mini-128k-instruct) for document analysis tasks, but keep actual context under 16K on 8 GB. - Pair with Whisper for voice-to-text-to-response pipelines. Both models fit on a single RTX 4060.
- Use speculative decoding with Phi-3 Mini as a draft model for Phi-3 Medium to boost the larger model’s throughput.
- Enable continuous batching to serve 4+ concurrent users at production quality on budget GPUs.
Estimate running costs with the cost calculator. For the cheapest GPU options, see our budget GPU for AI inference guide.
Next Steps
Phi-3 is one of the most cost-effective models for self-hosting. If you need stronger multilingual support, consider Qwen 2.5. For larger-scale deployments, see our LLaMA hosting options. Browse all models in the model guides section.
Deploy Phi-3 Now
Run Phi-3 on a dedicated GPU server starting from just an RTX 4060. Full root access and UK data centre hosting.
Browse GPU Servers