RTX 3050 - Order Now
Home / Blog / Model Guides / Phi VRAM Requirements (Phi-2, Phi-3, Phi-3.5)
Model Guides

Phi VRAM Requirements (Phi-2, Phi-3, Phi-3.5)

Complete Microsoft Phi VRAM requirements for Phi-2, Phi-3, and Phi-3.5 (Mini, Small, Medium, MoE). FP32, FP16, INT8, INT4 tables plus GPU picks.

Phi VRAM Requirements Overview

Microsoft’s Phi family is designed to maximize performance per parameter, making these models extremely efficient for their size. From the 2.7B Phi-2 to the 42B Phi-3.5 MoE, they deliver impressive benchmarks while staying VRAM-friendly. This guide covers every Phi variant for choosing the right dedicated GPU server for Phi hosting.

The Phi-3 and Phi-3.5 models introduced support for 128K context through a technique called LongRoPE, though the VRAM cost of long context is substantial. The Mini variants (3.8B) are particularly attractive for edge and resource-constrained deployments.

Complete VRAM Table (All Models)

ModelParametersFP32FP16INT8INT4
Phi-22.7B~11 GB~5.4 GB~2.8 GB~1.8 GB
Phi-3 Mini (4K ctx)3.8B~15 GB~7.6 GB~4 GB~2.5 GB
Phi-3 Mini (128K ctx)3.8B~15 GB~7.6 GB~4 GB~2.5 GB
Phi-3 Small7B~28 GB~14 GB~7.5 GB~4.5 GB
Phi-3 Medium (4K ctx)14B~56 GB~28 GB~14 GB~9 GB
Phi-3 Medium (128K ctx)14B~56 GB~28 GB~14 GB~9 GB
Phi-3.5 Mini3.8B~15 GB~7.6 GB~4 GB~2.5 GB
Phi-3.5 MoE42B (MoE)~168 GB~84 GB~42 GB~23 GB
Phi-3.5 Vision4.2B~17 GB~8.5 GB~4.5 GB~3 GB

Note: The 4K and 128K context variants have identical weight sizes. The difference is in KV cache VRAM at runtime (see context length section below). Phi-3.5 MoE uses 16 experts with 2 active, similar to Mixtral’s design. For other small model options, see our Gemma VRAM requirements page.

Which GPU Do You Need?

GPUVRAMBest Phi ModelPrecisionUse Case
RTX 30508 GBPhi-3.5 Mini / Phi-3 SmallFP16 / 4-bitEdge / dev
RTX 40608 GBPhi-3 SmallINT8 / 4-bitDev / personal
RTX 4060 Ti16 GBPhi-3 Small / MediumFP16 / 4-bitSmall production
RTX 309024 GBPhi-3 Medium / Phi-3.5 MoEFP16 / 4-bitProduction
2x RTX 309048 GBPhi-3.5 MoEINT8Best MoE perf

Phi-3.5 Mini at FP16 fits on any GPU with 8+ GB, making it one of the most accessible high-quality models available.

Context Length Impact on VRAM

The 128K context variants can use enormous amounts of KV cache memory:

ContextMini (3.8B) KVSmall (7B) KVMedium (14B) KV
4,096~0.3 GB~0.5 GB~1 GB
8,192~0.6 GB~1 GB~2 GB
32,768~2.5 GB~4 GB~8 GB
65,536~5 GB~8 GB~16 GB
131,072~10 GB~16 GB~32 GB

Phi-3 Mini 128K at full context length needs ~10 GB just for KV cache plus ~7.6 GB for weights = ~18 GB total at FP16. This means you need an RTX 3090 or better to actually use the full 128K window.

Batch Size Impact on VRAM

Model (FP16, 4K ctx)Batch 1Batch 4Batch 8Batch 16
Phi-3.5 Mini~8 GB~9.2 GB~10.5 GB~13 GB
Phi-3 Small~14.5 GB~16.5 GB~18.5 GB~22.5 GB
Phi-3 Medium~29 GB~33 GB~37 GB~45 GB

Phi-3.5 Mini is exceptionally efficient for batched inference. At FP16, you can serve 16 concurrent users within 13 GB, making it perfect for production APIs on a single RTX 4060 Ti.

Practical Deployment Recommendations

  • Edge/embedded: Phi-3.5 Mini (4-bit) on any GPU with 4+ GB. 2.5 GB weight footprint leaves room on shared GPU systems.
  • Personal assistant: Phi-3 Small on RTX 4060 (4-bit). 7B quality in a smaller footprint.
  • Production API: Phi-3.5 Mini on RTX 4060 Ti (FP16). High throughput, 16+ concurrent users.
  • High quality: Phi-3 Medium on RTX 3090 (4-bit or INT8). 14B quality competitive with larger models.
  • Best Phi performance: Phi-3.5 MoE on 2x RTX 3090 (INT8). MoE architecture gives top-tier results.

For cost comparisons, see our cheapest GPU for AI inference guide and the cost per million tokens calculator.

Quick Setup Commands

Ollama

curl -fsSL https://ollama.com/install.sh | sh
ollama run phi3:mini        # Phi-3 Mini 3.8B
ollama run phi3:medium      # Phi-3 Medium 14B
ollama run phi3.5:latest    # Phi-3.5 Mini

vLLM

# Phi-3.5 Mini FP16 on RTX 4060 Ti
vllm serve microsoft/Phi-3.5-mini-instruct \
  --dtype float16 --max-model-len 4096 --trust-remote-code

# Phi-3 Medium AWQ on RTX 3090
vllm serve microsoft/Phi-3-medium-4k-instruct \
  --quantization awq --max-model-len 4096 --trust-remote-code

For full deployment guides, see our Ollama hosting and vLLM hosting pages. Compare Phi models against others on our best GPU for LLM inference page and use the benchmark tool for speed comparisons. Also see our self-host LLM guide for complete setup walkthroughs.

Deploy This Model Now

Dedicated GPU servers with the VRAM you need. UK datacenter, full root access.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?