RTX 3050 - Order Now
Home / Blog / Benchmarks / Phi-3 Mini on RTX 5080: Performance Benchmark & Cost, Category: Benchmarks, Slug: phi-3-mini-on-rtx-5080-benchmark, Excerpt: Phi-3 Mini benchmarked on RTX 5080: 82 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>
Benchmarks

Phi-3 Mini on RTX 5080: Performance Benchmark & Cost, Category: Benchmarks, Slug: phi-3-mini-on-rtx-5080-benchmark, Excerpt: Phi-3 Mini benchmarked on RTX 5080: 82 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Phi-3 Mini benchmarked on RTX 5080: 82 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 -->

NVIDIA’s Blackwell-generation RTX 5080 brings a major memory-bandwidth uplift over the 40-series. For a model as compact as Phi-3 Mini (3.8B), that translates directly into faster token generation. We measured 82 tok/s single-stream on GigaGPU dedicated hardware — here is the full picture.

Throughput & Latency

MetricValue
Tokens/sec (single stream)82 tok/s
Tokens/sec (batched, bs=8)131.2 tok/s
Per-token latency12.2 ms
PrecisionFP16
QuantisationFP16
Max context length8K
Performance ratingExcellent

Single-stream at 512-token prompt, 256-token completion, llama.cpp backend. Phi-3 Mini is bandwidth-limited at this scale, and the 5080’s faster GDDR7 bus is doing the heavy lifting.

How VRAM Splits

ComponentVRAM
Model weights (FP16)8.0 GB
KV cache + runtime~1.2 GB
Total RTX 5080 VRAM16 GB
Free headroom~8.0 GB

Half the VRAM remains available after loading the model. That is enough to extend context, serve multiple concurrent users, or layer a second small model on the same card without running into OOM errors.

Running Costs

Cost MetricValue
Server cost£0.95/hr (£189/mo)
Cost per 1M tokens£3.218
Tokens per £1310,752
Break-even vs API~1 req/day

At £3.22 per million tokens (single-stream), the 5080 actually edges out the RTX 3090 on per-token cost while delivering 32% more throughput. Batched, you are looking at roughly £2.01/M. Use our cost calculator to model your own traffic patterns.

Where This Fits

Eighty-two tokens per second puts Phi-3 Mini responses well within the “feels instant” range for end users. This is a strong choice for production chatbots, real-time extraction pipelines, and any workload that demands both speed and the model’s reasoning capability. If you need even more headroom for multi-model deployments, the RTX 5090 with 32 GB takes things further.

Spin it up:

docker run --gpus all -p 8080:8080 ghcr.io/ggerganov/llama.cpp:server -m /models/phi-3-mini.Q4_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 99

More detail in the Phi-3 hosting guide. Related reads: best GPU for LLM inference, full benchmark index, and tok/s comparison tool.

82 tok/s Phi-3 Mini — RTX 5080 Servers

Blackwell-generation speed at a flat monthly rate. UK datacentre, root access included.

Order an RTX 5080

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?