Before you deploy Phi-3 on a dedicated GPU server, you need to know exactly how much VRAM each variant consumes at different precisions. This guide gives you the real numbers — measured on GigaGPU dedicated servers — so you can match your model to the right hardware without guessing.
VRAM by Variant and Precision
Each row shows the minimum VRAM needed to load the model weights. Add 10-20% headroom for KV cache, activations, and batch processing.
| Variant | FP16 VRAM | INT8 VRAM | INT4 VRAM |
|---|---|---|---|
| Phi-3 Mini (3.8B) | 7.6 GB | 3.8 GB | 2.3 GB |
| Phi-3 Small (7B) | 14 GB | 7 GB | 4.5 GB |
| Phi-3 Medium (14B) | 28 GB | 14 GB | 9 GB |
| Phi-3.5 Mini (3.8B) | 7.6 GB | 3.8 GB | 2.3 GB |
| Phi-3.5 MoE (42B) | 84 GB | 42 GB | 26 GB |
Which GigaGPU Server Fits Phi-3?
Based on the VRAM table above, here’s how Phi-3 maps to our GPU lineup:
| GPU | VRAM | Verdict |
|---|---|---|
| RTX 3050 | 6 GB | Only smallest variant (INT4) |
| RTX 4060 | 8 GB | Small variants, INT4/INT8 |
| RTX 4060 Ti 16GB | 16 GB | Mid variants FP16, larger at INT4 |
| RTX 3090 | 24 GB | Most variants FP16 with headroom |
| RTX 5090 | 32 GB | All standard variants FP16 |
| RTX 6000 Pro | 96 GB | Even the largest variants with room for batching |
Context Length Impact
VRAM requirements scale with context length. A 32K context adds roughly 2-4 GB of KV cache on top of base weights. For 128K contexts on large variants, you may need to move up a GPU tier or use quantised KV cache. See our context length VRAM guide for details.
Deployment Recommendations
For production deployments:
- Development & prototyping: Use INT4 on the smallest GPU that fits — minimise cost while you iterate.
- Production inference: Use FP16 on a GPU with at least 20% headroom. This avoids OOM under batch load.
- High-throughput serving: Step up to a larger GPU to batch more requests simultaneously.
Our best GPU for LLM inference guide walks through the full decision matrix across every workload type.
Deploy Phi-3 on a Dedicated GPU Server
Fixed monthly pricing, full root access, UK datacenter. Pick the GPU that matches your Phi-3 variant.
Browse GPU ServersFor cost analysis, use our LLM cost calculator or check cost per million tokens by GPU.