Phi-3-mini on the RTX 5060 Ti 16GB delivers the lowest cost per million tokens of any serious self-hosted LLM on our hosting. Small model plus huge concurrency on 16 GB is a volume-economics machine.
Contents
Throughput
Phi-3-mini BF16 on 5060 Ti benefits enormously from batching:
- Batch 1: ~135 t/s
- Batch 16: ~1,100 t/s aggregate
- Batch 32: ~1,400 t/s aggregate
- Batch 64: ~1,550 t/s aggregate peak
Monthly Capacity
At 50% utilisation on batch 32:
- Output tokens: ~1.8 billion/month
- Input tokens (3:1): ~5.5B/month
- Blended: ~7.3B tokens/month
Cost Per Million Tokens
At ~£300/month dedicated hosting:
- Blended cost per million tokens: £300 / 7,300 = ~£0.04 per M tokens
- At 80% utilisation (high-QPS backend): ~£0.025 per M tokens
Compare to APIs:
- OpenAI GPT-4o-mini blended: ~$0.30/M – 10-15x more expensive
- Together Phi-3 (if offered): ~$0.10/M – 2-3x more expensive
- Anthropic Haiku: ~$2.50 blended – 60x+ more expensive
Where It Pays Back
- High-volume classification and tagging (20k+ decisions/hour)
- Lightweight chat with many concurrent users
- Structured output extraction at scale
- Routing layer before hitting a larger model
- Social listening, sentiment analysis
- Content moderation at volume
Pick Phi-mini When
- Your task is bounded (classification, extraction) rather than open-ended
- Volume > 100k requests/day
- Per-request latency budget < 500 ms
- Model quality above Phi-mini’s ceiling is not needed
For workloads needing broader reasoning, use Llama 3 8B on the same card.
Cheapest Tokens on Dedicated GPU
Phi-3-mini at massive concurrency on Blackwell 16GB. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: deployment guide, classification use case.