RTX 3050 - Order Now
Home / Blog / Benchmarks / Phi-3 Mini Tokens/sec by GPU
Benchmarks

Phi-3 Mini Tokens/sec by GPU

Benchmark results for Microsoft Phi-3 Mini (3.8B) inference speed across six GPUs with FP16 and INT4 comparisons, plus cost-efficiency data for dedicated GPU hosting.

Phi-3 Mini Benchmark Overview

Microsoft Phi-3 Mini is a compact 3.8-billion-parameter model that punches well above its weight class, offering strong reasoning and instruction-following in a small footprint. Its modest VRAM requirements make it a natural fit for cost-effective dedicated GPU servers. In this benchmark we measure tokens per second across six GPUs to help you choose the right hardware.

All tests used vLLM with a 512-token input prompt and 256-token output generation. Because Phi-3 Mini is only 3.8B parameters, it fits comfortably in FP16 on GPUs with as little as 8 GB of VRAM. For broader model comparisons, see our tokens per second benchmark hub.

Tokens/sec Results by GPU

GPUVRAMPhi-3 Mini FP16 (tok/s)Notes
RTX 30506 GB22 tok/sFits in 6 GB VRAM
RTX 40608 GB38 tok/sComfortable fit
RTX 4060 Ti16 GB52 tok/sPlenty of headroom
RTX 309024 GB68 tok/sRoom for concurrent requests
RTX 508016 GB105 tok/sExcellent throughput
RTX 509032 GB148 tok/sBest-in-class speed

Phi-3 Mini’s small parameter count means even the RTX 3050 can run it at FP16 with usable speed. The RTX 4060 at 38 tok/s is more than adequate for development workloads, while the RTX 5090 reaches an impressive 148 tok/s.

FP16 vs INT4 Speed Comparison

Since Phi-3 Mini already has a small memory footprint, quantisation is less about fitting the model and more about maximising throughput. Here we compare FP16 and INT4. For more on quantisation trade-offs, see our LLaMA 3 8B quantisation comparison.

GPUFP16 (tok/s)INT4 (tok/s)Speed Gain
RTX 3050 (6 GB)2231+41%
RTX 4060 (8 GB)3852+37%
RTX 4060 Ti (16 GB)5272+38%
RTX 3090 (24 GB)6894+38%
RTX 5080 (16 GB)105143+36%
RTX 5090 (32 GB)148198+34%

INT4 delivers a consistent 34-41% speed improvement across all GPUs. Given that Phi-3 Mini shows minimal quality loss at 4-bit precision, INT4 is the recommended configuration for throughput-focused deployments.

Cost Efficiency Analysis

Phi-3 Mini is one of the most cost-effective models to serve. Below we compare tokens per second per pound of monthly dedicated GPU hosting cost.

GPUFP16 tok/sApprox. Monthly Costtok/s per Pound
RTX 305022~£450.49
RTX 406038~£600.63
RTX 4060 Ti52~£750.69
RTX 309068~£1100.62
RTX 5080105~£1600.66
RTX 5090148~£2500.59

The RTX 4060 Ti offers the best cost efficiency for Phi-3 Mini, closely followed by the RTX 5080. If you need the best GPU for Phi-3, the 4060 Ti is hard to beat on value.

GPU Recommendations

  • Best budget: RTX 4060 — 38 tok/s at FP16 for just ~£60/month. Ideal for development and testing.
  • Best value: RTX 4060 Ti — the highest tokens per pound at 0.69 tok/s/£.
  • Best performance: RTX 5090 — 148 tok/s for high-concurrency production APIs.
  • Sweet spot: RTX 5080 — near-top speed at a more moderate price point.

Compare these results with the Qwen 2.5 7B benchmark or the Gemma 2 9B results to see how Phi-3 Mini stacks up against similarly-sized models. Browse all data in the Benchmarks category.

Conclusion

Phi-3 Mini is one of the most efficient models to deploy, running at full FP16 precision on even modest 6 GB GPUs. Its compact size makes it ideal for edge-like dedicated server deployments where cost is a primary concern, while still delivering strong reasoning and instruction-following capabilities.

Run Phi-3 Mini on Dedicated GPU Servers

Affordable bare-metal GPU hosting starting from budget GPUs. Full root access, fast NVMe, and 24/7 support.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?