Home / Blog / Benchmarks / Phi-3 Mini on RTX 5080: Performance Benchmark & Cost, Category: Benchmarks, Slug: phi-3-mini-on-rtx-5080-benchmark, Excerpt: Phi-3 Mini benchmarked on RTX 5080: 82 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Benchmarks

Phi-3 Mini on RTX 5080: Performance Benchmark & Cost, Category: Benchmarks, Slug: phi-3-mini-on-rtx-5080-benchmark, Excerpt: Phi-3 Mini benchmarked on RTX 5080: 82 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Phi-3 Mini benchmarked on RTX 5080: 82 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 -->

Benchmarks April 15, 2026 2 min read admin

NVIDIA’s Blackwell-generation RTX 5080 brings a major memory-bandwidth uplift over the 40-series. For a model as compact as Phi-3 Mini (3.8B), that translates directly into faster token generation. We measured 82 tok/s single-stream on GigaGPU dedicated hardware — here is the full picture.

Throughput & Latency

Metric	Value
Tokens/sec (single stream)	82 tok/s
Tokens/sec (batched, bs=8)	131.2 tok/s
Per-token latency	12.2 ms
Precision	FP16
Quantisation	FP16
Max context length	8K
Performance rating	Excellent

Single-stream at 512-token prompt, 256-token completion, llama.cpp backend. Phi-3 Mini is bandwidth-limited at this scale, and the 5080’s faster GDDR7 bus is doing the heavy lifting.

How VRAM Splits

Component	VRAM
Model weights (FP16)	8.0 GB
KV cache + runtime	~1.2 GB
Total RTX 5080 VRAM	16 GB
Free headroom	~8.0 GB

Half the VRAM remains available after loading the model. That is enough to extend context, serve multiple concurrent users, or layer a second small model on the same card without running into OOM errors.

Running Costs

Cost Metric	Value
Server cost	£0.95/hr (£189/mo)
Cost per 1M tokens	£3.218
Tokens per £1	310,752
Break-even vs API	~1 req/day

At £3.22 per million tokens (single-stream), the 5080 actually edges out the RTX 3090 on per-token cost while delivering 32% more throughput. Batched, you are looking at roughly £2.01/M. Use our cost calculator to model your own traffic patterns.

Where This Fits

Eighty-two tokens per second puts Phi-3 Mini responses well within the “feels instant” range for end users. This is a strong choice for production chatbots, real-time extraction pipelines, and any workload that demands both speed and the model’s reasoning capability. If you need even more headroom for multi-model deployments, the RTX 5090 with 32 GB takes things further.

Spin it up:

docker run --gpus all -p 8080:8080 ghcr.io/ggerganov/llama.cpp:server -m /models/phi-3-mini.Q4_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 99

More detail in the Phi-3 hosting guide. Related reads: best GPU for LLM inference, full benchmark index, and tok/s comparison tool.

82 tok/s Phi-3 Mini — RTX 5080 Servers

Blackwell-generation speed at a flat monthly rate. UK datacentre, root access included.

Order an RTX 5080

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Phi-3 Mini on RTX 5080: Performance Benchmark & Cost, Category: Benchmarks, Slug: phi-3-mini-on-rtx-5080-benchmark, Excerpt: Phi-3 Mini benchmarked on RTX 5080: 82 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Throughput & Latency

How VRAM Splits

Running Costs

Where This Fits

82 tok/s Phi-3 Mini — RTX 5080 Servers

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Phi-3 Mini on RTX 5080: Performance Benchmark & Cost, Category: Benchmarks, Slug: phi-3-mini-on-rtx-5080-benchmark, Excerpt: Phi-3 Mini benchmarked on RTX 5080: 82 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Throughput & Latency

How VRAM Splits

Running Costs

Where This Fits

82 tok/s Phi-3 Mini — RTX 5080 Servers

Need a Dedicated GPU Server?

admin

Related Articles

Coqui XTTS-v2 on RTX 3050: TTS Speed & Cost, Category: Benchmarks, Slug: coqui-xtts-v2-on-rtx-3050-benchmark, Excerpt: Coqui XTTS-v2 benchmarked on RTX 3050: RTF 0.65, 1.5x real-time processing, VRAM usage, and cost per audio hour., Internal links: 8 –>

Stable Diffusion XL on RTX 4060 Ti: Images/sec & VRAM Usage, Category: Benchmarks, Slug: sdxl-on-rtx-4060-ti-benchmark, Excerpt: Stable Diffusion XL benchmarked on RTX 4060 Ti: 1.9 it/s, 3.8 images/min at 1024×1024, VRAM usage, and cost per 1K images., Internal links: 8 –>

DeepSeek V3 Performance Report: April 2026

Stable Diffusion XL on RTX 5080: Images/sec & VRAM Usage, Category: Benchmarks, Slug: sdxl-on-rtx-5080-benchmark, Excerpt: Stable Diffusion XL benchmarked on RTX 5080: 4.8 it/s, 9.6 images/min at 1024×1024, VRAM usage, and cost per 1K images., Internal links: 8 –>

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?