Table of Contents
Phi VRAM Requirements Overview
Microsoft’s Phi family is designed to maximize performance per parameter, making these models extremely efficient for their size. From the 2.7B Phi-2 to the 42B Phi-3.5 MoE, they deliver impressive benchmarks while staying VRAM-friendly. This guide covers every Phi variant for choosing the right dedicated GPU server for Phi hosting.
The Phi-3 and Phi-3.5 models introduced support for 128K context through a technique called LongRoPE, though the VRAM cost of long context is substantial. The Mini variants (3.8B) are particularly attractive for edge and resource-constrained deployments.
Complete VRAM Table (All Models)
| Model | Parameters | FP32 | FP16 | INT8 | INT4 |
|---|---|---|---|---|---|
| Phi-2 | 2.7B | ~11 GB | ~5.4 GB | ~2.8 GB | ~1.8 GB |
| Phi-3 Mini (4K ctx) | 3.8B | ~15 GB | ~7.6 GB | ~4 GB | ~2.5 GB |
| Phi-3 Mini (128K ctx) | 3.8B | ~15 GB | ~7.6 GB | ~4 GB | ~2.5 GB |
| Phi-3 Small | 7B | ~28 GB | ~14 GB | ~7.5 GB | ~4.5 GB |
| Phi-3 Medium (4K ctx) | 14B | ~56 GB | ~28 GB | ~14 GB | ~9 GB |
| Phi-3 Medium (128K ctx) | 14B | ~56 GB | ~28 GB | ~14 GB | ~9 GB |
| Phi-3.5 Mini | 3.8B | ~15 GB | ~7.6 GB | ~4 GB | ~2.5 GB |
| Phi-3.5 MoE | 42B (MoE) | ~168 GB | ~84 GB | ~42 GB | ~23 GB |
| Phi-3.5 Vision | 4.2B | ~17 GB | ~8.5 GB | ~4.5 GB | ~3 GB |
Note: The 4K and 128K context variants have identical weight sizes. The difference is in KV cache VRAM at runtime (see context length section below). Phi-3.5 MoE uses 16 experts with 2 active, similar to Mixtral’s design. For other small model options, see our Gemma VRAM requirements page.
Which GPU Do You Need?
| GPU | VRAM | Best Phi Model | Precision | Use Case |
|---|---|---|---|---|
| RTX 3050 | 8 GB | Phi-3.5 Mini / Phi-3 Small | FP16 / 4-bit | Edge / dev |
| RTX 4060 | 8 GB | Phi-3 Small | INT8 / 4-bit | Dev / personal |
| RTX 4060 Ti | 16 GB | Phi-3 Small / Medium | FP16 / 4-bit | Small production |
| RTX 3090 | 24 GB | Phi-3 Medium / Phi-3.5 MoE | FP16 / 4-bit | Production |
| 2x RTX 3090 | 48 GB | Phi-3.5 MoE | INT8 | Best MoE perf |
Phi-3.5 Mini at FP16 fits on any GPU with 8+ GB, making it one of the most accessible high-quality models available.
Context Length Impact on VRAM
The 128K context variants can use enormous amounts of KV cache memory:
| Context | Mini (3.8B) KV | Small (7B) KV | Medium (14B) KV |
|---|---|---|---|
| 4,096 | ~0.3 GB | ~0.5 GB | ~1 GB |
| 8,192 | ~0.6 GB | ~1 GB | ~2 GB |
| 32,768 | ~2.5 GB | ~4 GB | ~8 GB |
| 65,536 | ~5 GB | ~8 GB | ~16 GB |
| 131,072 | ~10 GB | ~16 GB | ~32 GB |
Phi-3 Mini 128K at full context length needs ~10 GB just for KV cache plus ~7.6 GB for weights = ~18 GB total at FP16. This means you need an RTX 3090 or better to actually use the full 128K window.
Batch Size Impact on VRAM
| Model (FP16, 4K ctx) | Batch 1 | Batch 4 | Batch 8 | Batch 16 |
|---|---|---|---|---|
| Phi-3.5 Mini | ~8 GB | ~9.2 GB | ~10.5 GB | ~13 GB |
| Phi-3 Small | ~14.5 GB | ~16.5 GB | ~18.5 GB | ~22.5 GB |
| Phi-3 Medium | ~29 GB | ~33 GB | ~37 GB | ~45 GB |
Phi-3.5 Mini is exceptionally efficient for batched inference. At FP16, you can serve 16 concurrent users within 13 GB, making it perfect for production APIs on a single RTX 4060 Ti.
Practical Deployment Recommendations
- Edge/embedded: Phi-3.5 Mini (4-bit) on any GPU with 4+ GB. 2.5 GB weight footprint leaves room on shared GPU systems.
- Personal assistant: Phi-3 Small on RTX 4060 (4-bit). 7B quality in a smaller footprint.
- Production API: Phi-3.5 Mini on RTX 4060 Ti (FP16). High throughput, 16+ concurrent users.
- High quality: Phi-3 Medium on RTX 3090 (4-bit or INT8). 14B quality competitive with larger models.
- Best Phi performance: Phi-3.5 MoE on 2x RTX 3090 (INT8). MoE architecture gives top-tier results.
For cost comparisons, see our cheapest GPU for AI inference guide and the cost per million tokens calculator.
Quick Setup Commands
Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama run phi3:mini # Phi-3 Mini 3.8B
ollama run phi3:medium # Phi-3 Medium 14B
ollama run phi3.5:latest # Phi-3.5 Mini
vLLM
# Phi-3.5 Mini FP16 on RTX 4060 Ti
vllm serve microsoft/Phi-3.5-mini-instruct \
--dtype float16 --max-model-len 4096 --trust-remote-code
# Phi-3 Medium AWQ on RTX 3090
vllm serve microsoft/Phi-3-medium-4k-instruct \
--quantization awq --max-model-len 4096 --trust-remote-code
For full deployment guides, see our Ollama hosting and vLLM hosting pages. Compare Phi models against others on our best GPU for LLM inference page and use the benchmark tool for speed comparisons. Also see our self-host LLM guide for complete setup walkthroughs.
Deploy This Model Now
Dedicated GPU servers with the VRAM you need. UK datacenter, full root access.
Browse GPU Servers