Table of Contents
Phi-3 Mini Benchmark Overview
Microsoft Phi-3 Mini is a compact 3.8-billion-parameter model that punches well above its weight class, offering strong reasoning and instruction-following in a small footprint. Its modest VRAM requirements make it a natural fit for cost-effective dedicated GPU servers. In this benchmark we measure tokens per second across six GPUs to help you choose the right hardware.
All tests used vLLM with a 512-token input prompt and 256-token output generation. Because Phi-3 Mini is only 3.8B parameters, it fits comfortably in FP16 on GPUs with as little as 8 GB of VRAM. For broader model comparisons, see our tokens per second benchmark hub.
Tokens/sec Results by GPU
| GPU | VRAM | Phi-3 Mini FP16 (tok/s) | Notes |
|---|---|---|---|
| RTX 3050 | 6 GB | 22 tok/s | Fits in 6 GB VRAM |
| RTX 4060 | 8 GB | 38 tok/s | Comfortable fit |
| RTX 4060 Ti | 16 GB | 52 tok/s | Plenty of headroom |
| RTX 3090 | 24 GB | 68 tok/s | Room for concurrent requests |
| RTX 5080 | 16 GB | 105 tok/s | Excellent throughput |
| RTX 5090 | 32 GB | 148 tok/s | Best-in-class speed |
Phi-3 Mini’s small parameter count means even the RTX 3050 can run it at FP16 with usable speed. The RTX 4060 at 38 tok/s is more than adequate for development workloads, while the RTX 5090 reaches an impressive 148 tok/s.
FP16 vs INT4 Speed Comparison
Since Phi-3 Mini already has a small memory footprint, quantisation is less about fitting the model and more about maximising throughput. Here we compare FP16 and INT4. For more on quantisation trade-offs, see our LLaMA 3 8B quantisation comparison.
| GPU | FP16 (tok/s) | INT4 (tok/s) | Speed Gain |
|---|---|---|---|
| RTX 3050 (6 GB) | 22 | 31 | +41% |
| RTX 4060 (8 GB) | 38 | 52 | +37% |
| RTX 4060 Ti (16 GB) | 52 | 72 | +38% |
| RTX 3090 (24 GB) | 68 | 94 | +38% |
| RTX 5080 (16 GB) | 105 | 143 | +36% |
| RTX 5090 (32 GB) | 148 | 198 | +34% |
INT4 delivers a consistent 34-41% speed improvement across all GPUs. Given that Phi-3 Mini shows minimal quality loss at 4-bit precision, INT4 is the recommended configuration for throughput-focused deployments.
Cost Efficiency Analysis
Phi-3 Mini is one of the most cost-effective models to serve. Below we compare tokens per second per pound of monthly dedicated GPU hosting cost.
| GPU | FP16 tok/s | Approx. Monthly Cost | tok/s per Pound |
|---|---|---|---|
| RTX 3050 | 22 | ~£45 | 0.49 |
| RTX 4060 | 38 | ~£60 | 0.63 |
| RTX 4060 Ti | 52 | ~£75 | 0.69 |
| RTX 3090 | 68 | ~£110 | 0.62 |
| RTX 5080 | 105 | ~£160 | 0.66 |
| RTX 5090 | 148 | ~£250 | 0.59 |
The RTX 4060 Ti offers the best cost efficiency for Phi-3 Mini, closely followed by the RTX 5080. If you need the best GPU for Phi-3, the 4060 Ti is hard to beat on value.
GPU Recommendations
- Best budget: RTX 4060 — 38 tok/s at FP16 for just ~£60/month. Ideal for development and testing.
- Best value: RTX 4060 Ti — the highest tokens per pound at 0.69 tok/s/£.
- Best performance: RTX 5090 — 148 tok/s for high-concurrency production APIs.
- Sweet spot: RTX 5080 — near-top speed at a more moderate price point.
Compare these results with the Qwen 2.5 7B benchmark or the Gemma 2 9B results to see how Phi-3 Mini stacks up against similarly-sized models. Browse all data in the Benchmarks category.
Conclusion
Phi-3 Mini is one of the most efficient models to deploy, running at full FP16 precision on even modest 6 GB GPUs. Its compact size makes it ideal for edge-like dedicated server deployments where cost is a primary concern, while still delivering strong reasoning and instruction-following capabilities.
Run Phi-3 Mini on Dedicated GPU Servers
Affordable bare-metal GPU hosting starting from budget GPUs. Full root access, fast NVMe, and 24/7 support.
Browse GPU Servers