Home / Blog / Benchmarks / Phi-3 Mini Tokens/sec by GPU

Benchmarks

Phi-3 Mini Tokens/sec by GPU

Benchmark results for Microsoft Phi-3 Mini (3.8B) inference speed across six GPUs with FP16 and INT4 comparisons, plus cost-efficiency data for dedicated GPU hosting.

Benchmarks April 14, 2026 3 min read admin

Table of Contents

Phi-3 Mini Benchmark Overview
Tokens/sec Results by GPU
FP16 vs INT4 Speed Comparison
Cost Efficiency Analysis
GPU Recommendations
Conclusion

Phi-3 Mini Benchmark Overview

Microsoft Phi-3 Mini is a compact 3.8-billion-parameter model that punches well above its weight class, offering strong reasoning and instruction-following in a small footprint. Its modest VRAM requirements make it a natural fit for cost-effective dedicated GPU servers. In this benchmark we measure tokens per second across six GPUs to help you choose the right hardware.

All tests used vLLM with a 512-token input prompt and 256-token output generation. Because Phi-3 Mini is only 3.8B parameters, it fits comfortably in FP16 on GPUs with as little as 8 GB of VRAM. For broader model comparisons, see our tokens per second benchmark hub.

Tokens/sec Results by GPU

GPU	VRAM	Phi-3 Mini FP16 (tok/s)	Notes
RTX 3050	6 GB	22 tok/s	Fits in 6 GB VRAM
RTX 4060	8 GB	38 tok/s	Comfortable fit
RTX 4060 Ti	16 GB	52 tok/s	Plenty of headroom
RTX 3090	24 GB	68 tok/s	Room for concurrent requests
RTX 5080	16 GB	105 tok/s	Excellent throughput
RTX 5090	32 GB	148 tok/s	Best-in-class speed

Phi-3 Mini’s small parameter count means even the RTX 3050 can run it at FP16 with usable speed. The RTX 4060 at 38 tok/s is more than adequate for development workloads, while the RTX 5090 reaches an impressive 148 tok/s.

FP16 vs INT4 Speed Comparison

Since Phi-3 Mini already has a small memory footprint, quantisation is less about fitting the model and more about maximising throughput. Here we compare FP16 and INT4. For more on quantisation trade-offs, see our LLaMA 3 8B quantisation comparison.

GPU	FP16 (tok/s)	INT4 (tok/s)	Speed Gain
RTX 3050 (6 GB)	22	31	+41%
RTX 4060 (8 GB)	38	52	+37%
RTX 4060 Ti (16 GB)	52	72	+38%
RTX 3090 (24 GB)	68	94	+38%
RTX 5080 (16 GB)	105	143	+36%
RTX 5090 (32 GB)	148	198	+34%

INT4 delivers a consistent 34-41% speed improvement across all GPUs. Given that Phi-3 Mini shows minimal quality loss at 4-bit precision, INT4 is the recommended configuration for throughput-focused deployments.

Cost Efficiency Analysis

Phi-3 Mini is one of the most cost-effective models to serve. Below we compare tokens per second per pound of monthly dedicated GPU hosting cost.

GPU	FP16 tok/s	Approx. Monthly Cost	tok/s per Pound
RTX 3050	22	~£45	0.49
RTX 4060	38	~£60	0.63
RTX 4060 Ti	52	~£75	0.69
RTX 3090	68	~£110	0.62
RTX 5080	105	~£160	0.66
RTX 5090	148	~£250	0.59

The RTX 4060 Ti offers the best cost efficiency for Phi-3 Mini, closely followed by the RTX 5080. If you need the best GPU for Phi-3, the 4060 Ti is hard to beat on value.

GPU Recommendations

Best budget: RTX 4060 — 38 tok/s at FP16 for just ~£60/month. Ideal for development and testing.
Best value: RTX 4060 Ti — the highest tokens per pound at 0.69 tok/s/£.
Best performance: RTX 5090 — 148 tok/s for high-concurrency production APIs.
Sweet spot: RTX 5080 — near-top speed at a more moderate price point.

Compare these results with the Qwen 2.5 7B benchmark or the Gemma 2 9B results to see how Phi-3 Mini stacks up against similarly-sized models. Browse all data in the Benchmarks category.

Conclusion

Phi-3 Mini is one of the most efficient models to deploy, running at full FP16 precision on even modest 6 GB GPUs. Its compact size makes it ideal for edge-like dedicated server deployments where cost is a primary concern, while still delivering strong reasoning and instruction-following capabilities.

Run Phi-3 Mini on Dedicated GPU Servers

Affordable bare-metal GPU hosting starting from budget GPUs. Full root access, fast NVMe, and 24/7 support.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Phi-3 Mini Tokens/sec by GPU

Phi-3 Mini Benchmark Overview

Tokens/sec Results by GPU

FP16 vs INT4 Speed Comparison

Cost Efficiency Analysis

GPU Recommendations

Conclusion

Run Phi-3 Mini on Dedicated GPU Servers

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Phi-3 Mini Tokens/sec by GPU

Phi-3 Mini Benchmark Overview

Tokens/sec Results by GPU

FP16 vs INT4 Speed Comparison

Cost Efficiency Analysis

GPU Recommendations

Conclusion

Run Phi-3 Mini on Dedicated GPU Servers

Need a Dedicated GPU Server?

admin

Related Articles

YOLOv8 on RTX 4060: Detection FPS & Cost, Category: Benchmarks, Slug: yolov8-on-rtx-4060-benchmark, Excerpt: YOLOv8 benchmarked on RTX 4060: 42 FPS, VRAM usage, cost efficiency, and deployment configuration., Internal links: 8 –>

Multi-Model Serving: 2-4 Models on One GPU

vLLM vs Ollama at 1/10/50/100 Users

Mistral 7B on RTX 3050: Performance Benchmark & Cost, Category: Benchmarks, Slug: mistral-7b-on-rtx-3050-benchmark, Excerpt: Mistral 7B benchmarked on RTX 3050: 10.0 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?