NVIDIA’s Blackwell-generation RTX 5080 brings a major memory-bandwidth uplift over the 40-series. For a model as compact as Phi-3 Mini (3.8B), that translates directly into faster token generation. We measured 82 tok/s single-stream on GigaGPU dedicated hardware — here is the full picture.
Throughput & Latency
| Metric | Value |
|---|---|
| Tokens/sec (single stream) | 82 tok/s |
| Tokens/sec (batched, bs=8) | 131.2 tok/s |
| Per-token latency | 12.2 ms |
| Precision | FP16 |
| Quantisation | FP16 |
| Max context length | 8K |
| Performance rating | Excellent |
Single-stream at 512-token prompt, 256-token completion, llama.cpp backend. Phi-3 Mini is bandwidth-limited at this scale, and the 5080’s faster GDDR7 bus is doing the heavy lifting.
How VRAM Splits
| Component | VRAM |
|---|---|
| Model weights (FP16) | 8.0 GB |
| KV cache + runtime | ~1.2 GB |
| Total RTX 5080 VRAM | 16 GB |
| Free headroom | ~8.0 GB |
Half the VRAM remains available after loading the model. That is enough to extend context, serve multiple concurrent users, or layer a second small model on the same card without running into OOM errors.
Running Costs
| Cost Metric | Value |
|---|---|
| Server cost | £0.95/hr (£189/mo) |
| Cost per 1M tokens | £3.218 |
| Tokens per £1 | 310,752 |
| Break-even vs API | ~1 req/day |
At £3.22 per million tokens (single-stream), the 5080 actually edges out the RTX 3090 on per-token cost while delivering 32% more throughput. Batched, you are looking at roughly £2.01/M. Use our cost calculator to model your own traffic patterns.
Where This Fits
Eighty-two tokens per second puts Phi-3 Mini responses well within the “feels instant” range for end users. This is a strong choice for production chatbots, real-time extraction pipelines, and any workload that demands both speed and the model’s reasoning capability. If you need even more headroom for multi-model deployments, the RTX 5090 with 32 GB takes things further.
Spin it up:
docker run --gpus all -p 8080:8080 ghcr.io/ggerganov/llama.cpp:server -m /models/phi-3-mini.Q4_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 99
More detail in the Phi-3 hosting guide. Related reads: best GPU for LLM inference, full benchmark index, and tok/s comparison tool.
82 tok/s Phi-3 Mini — RTX 5080 Servers
Blackwell-generation speed at a flat monthly rate. UK datacentre, root access included.
Order an RTX 5080