Table of Contents
For self-hosted production AI, capacity planning is about predicting when you'll need to scale and pre-provisioning the next tier before users notice degradation. Standard load testing + traffic forecasting + headroom math gives you the answer.
Capacity model: per-GPU concurrent-user limit at p99 TTFT SLO. Plan for 2× current peak as headroom. Scaling triggers: sustained > 70% of capacity over 7 days, p99 latency degrading, queue depth rising. Ramp up by adding replicas (data parallel) before tier-jumping.
Capacity model
For each GPU + model combination, characterise:
- Concurrent users at p99 TTFT < SLO (typically 2 s)
- Sustained throughput tok/s aggregate
- Maximum concurrent (where errors begin) — the absolute ceiling
Reference numbers (from earlier batches): Mistral 7B FP8 5060 Ti ~30 concurrent, 4090 ~80, 5090 ~150. Plan headroom from these.
Planning
- Measure current peak: 95th percentile of concurrent users over last 30 days
- Forecast 30-90 days out: based on growth trend
- Add 2× headroom: capacity = 2 × forecast peak
- Identify next tier: which GPU / replica configuration delivers that capacity?
- Plan migration: standard blue-green pattern
Scaling triggers
Three concrete triggers to act on:
- Sustained > 70% of capacity over 7 days: order next tier
- p99 TTFT degrading week-over-week: capacity-bound; scale
- Queue depth p95 > 10: load shedding starting; scale
Verdict
Capacity planning for self-hosted AI is straightforward once you have the per-tier capacity numbers. Measure, forecast, plan headroom, scale before degradation. The biggest mistake is reactive scaling — provisioning new hardware after users have already complained.
Bottom line
Plan capacity ahead; scale before degradation. See auto-scaling patterns.