A canary rollout routes a small fraction of traffic to the new model while the majority stays on the current one. Metrics and error rates reveal regressions before they affect everyone. On our dedicated GPU hosting this is the safest pattern for model upgrades when you cannot fully validate in staging.
Contents
Traffic Splitting
nginx supports weighted upstream servers:
upstream llm {
server stable01:8000 weight=95;
server canary01:8000 weight=5;
}
95% of requests go to stable, 5% to canary. Adjust weights as confidence grows: 5% -> 25% -> 50% -> 100%.
Metrics
Watch per-upstream:
- p50 and p99 latency – should match or improve
- 5xx error rate – should stay at baseline
- Token generation rate (tokens/sec) – quality proxy
- User-facing metrics where possible (thumbs up/down, retry rate)
In Grafana, plot the canary metric next to stable. Any divergence is a signal.
Promotion
Define a promotion gate:
- No 5xx rate increase over 15+ minutes at current traffic level
- p99 latency within 10% of stable
- No degraded user-facing metrics
If gate passes, double the canary weight. Repeat until canary is at 100%. Typical timeline: 30 minutes to 4 hours depending on traffic volume and risk tolerance.
Rollback
Automate it. A script watching metrics can drop canary weight to 0 on first sign of regression:
if error_rate(canary) > 1.5 * error_rate(stable):
set_weight(canary, 0)
alert_oncall("canary rolled back: elevated errors")
Canary-Ready LLM Hosting
Multi-replica UK dedicated GPU hosting with nginx traffic splitting preconfigured.
Browse GPU Servers