Microsoft positioned Phi-3 as proof that small models could punch above their weight. Phi-3.5 takes that thesis further by adding a Mixture-of-Experts variant, expanding multilingual support to over 20 languages, and improving long-context handling — all while keeping the compact footprint that made Phi attractive for dedicated GPU deployments in the first place.
New Model Variants
Phi-3 shipped in three dense sizes: Mini (3.8B), Small (7B), and Medium (14B). Phi-3.5 retains the Mini size and introduces a MoE variant that slots between Small and Medium in effective capability.
| Model | Parameters | Active Params | Context | Architecture |
|---|---|---|---|---|
| Phi-3 Mini | 3.8B | 3.8B | 4K / 128K | Dense |
| Phi-3.5 Mini | 3.8B | 3.8B | 128K | Dense |
| Phi-3.5 MoE | 41.9B | 6.6B | 128K | 16 experts, 2 active |
| Phi-3.5 Vision | 4.2B | 4.2B | 128K | Dense + Vision Encoder |
The MoE variant is the headline addition. With 41.9B total parameters but only 6.6B active per token, it runs at roughly the same speed as a 7B dense model while delivering quality closer to a 14B model. For teams exploring Phi-3 size selection, the MoE variant adds a compelling new option.
Benchmark Gains
| Benchmark | Phi-3 Mini (3.8B) | Phi-3.5 Mini (3.8B) | Phi-3.5 MoE (6.6B active) |
|---|---|---|---|
| MMLU | 68.8 | 69.0 | 78.9 |
| HumanEval | 58.5 | 62.8 | 70.4 |
| GSM8K | 75.7 | 77.9 | 88.7 |
| Multilingual MMLU | 55.4 | 62.9 | 69.9 |
| RULER (128K ctx) | N/A | 84.0 | N/A |
The Multilingual MMLU improvement — from 55.4 to 62.9 at the Mini size — reflects the expanded language training. If your application serves non-English users, 3.5 is a meaningful upgrade without changing hardware.
VRAM Impact
Phi-3.5 Mini is a direct swap for Phi-3 Mini with no additional VRAM cost. The MoE variant requires loading all 41.9B parameters but activates only 6.6B, so VRAM is determined by total weight size while compute cost tracks active parameters.
| Model | FP16 VRAM | INT4 VRAM | Minimum GPU |
|---|---|---|---|
| Phi-3.5 Mini | 7.6 GB | 2.8 GB | RTX 3090 |
| Phi-3.5 MoE | 83.8 GB | 24 GB | RTX 3090 (INT4) / RTX 6000 Pro 96 GB (FP16) |
| Phi-3.5 Vision | 8.5 GB | 3.2 GB | RTX 3090 |
The MoE variant at INT4 fits on a single RTX 3090 with 24 GB. That is remarkably efficient for a model that benchmarks near 14B-class quality. Compare VRAM profiles using our tokens-per-second benchmark tool.
Migration Notes
Upgrading from Phi-3 Mini to Phi-3.5 Mini requires only a model weight swap. The tokeniser and chat template are compatible. Key considerations:
- The 128K context window is now default (Phi-3 Mini had separate 4K and 128K variants). Set
--max-model-lento your actual requirement to save VRAM. - Vision capabilities in Phi-3.5 Vision need a multimodal serving framework. vLLM supports this natively with
--image-input-type pixel_values. - For the MoE variant, ensure your serving framework supports sparse models. vLLM and TGI both handle Phi-3.5 MoE correctly.
- Test multilingual outputs if you are expanding language coverage — the improvement is real but not uniform across all languages.
Which Version to Deploy
Deploy Phi-3.5 Mini if you are already on Phi-3 Mini — it is a free quality upgrade with identical resource requirements. Deploy the MoE variant if you need better reasoning and coding ability without jumping to a larger dense model like Mistral Large or Qwen 2.5 72B.
For a broader comparison of small models in this category, see the Gemma 2 vs Gemma 1 guide. Explore detailed deployment steps in the self-hosted LLM guide and the best GPU for inference breakdown.
Run Phi-3.5 on Dedicated Hardware
Deploy Phi-3.5 Mini, MoE, or Vision on bare-metal GPU servers. Compact models, full root access, no per-token fees.
Browse GPU Servers