Table of Contents
Production AI services occasionally degrade: GPU thermal throttling, queue overflow, hosted-API rate limits, model loading delay. Graceful degradation patterns turn these from outages into UX speed bumps. Plan them, don't hope for them.
Five degradation patterns: (1) fallback to smaller faster model, (2) cached response, (3) simpler prompt template, (4) hosted-API fallback, (5) queue + degraded UX (loading spinner with longer wait). Trigger via circuit breaker on latency / error rate. Always degrade visibly — tell users what's happening.
Strategies
- Smaller-model fallback: primary unavailable → route to smaller faster model. Quality degrades; service stays up.
- Cache fallback: serve previously cached response for similar query. Slightly stale; instant.
- Simpler prompt: drop optional context, simpler instructions. Faster generation; lower quality.
- Hosted-API fallback: route to Claude / GPT-4o when self-hosted unavailable. Pricier but works.
- Queue + extended wait: hold request, show user "working on it" state with longer-than-usual wait time.
Triggers
Implement via circuit breaker pattern:
- Error rate > 5% over 1 minute: trip breaker, route to fallback
- p99 TTFT > 2× SLO over 2 minutes: degrade
- vLLM queue depth > threshold: shed load to fallback
- Health check fail: immediate fallback
- Recovery: half-open after 30 seconds; gradual recovery probing
UX
Always degrade visibly to users:
- "Using a faster model right now" if quality dropped
- "This response is from cache" if served from cache
- "Slightly slower than usual" with status indicator
- Don't silently degrade quality; users notice and trust drops
Verdict
Graceful degradation patterns turn AI tier issues from outages into speed bumps. Implement before you need them; circuit breakers + fallback routing are standard SRE patterns applied to LLM serving. Always degrade visibly — user trust depends on it.
Bottom line
Degrade gracefully and visibly. See incident response.