Table of Contents
Production AI APIs need careful error handling. LLM serving introduces error classes web apps don't have (context-window-exceeded, rate-limited by hosted fallback, malformed structured output). Standard error semantics + structured error codes + user-friendly messages prevent confusing failures.
Five error classes: 4xx user errors (rate limit, context-too-long, malformed input), 5xx infrastructure (GPU overloaded, model not loaded), quality errors (output validation failed), upstream errors (hosted-API fallback failed), quota errors (per-tenant budget exceeded). Return OpenAI-compatible error shapes; structured error codes for client-side handling.
Error classes
- 400 invalid_request: malformed JSON, missing required fields
- 401 unauthorised: missing / invalid API key
- 403 forbidden: per-tenant tier doesn't allow this model / context-length
- 413 context_length_exceeded: input + output exceeds max-model-len
- 429 rate_limited: per-tenant rate / quota exceeded
- 500 internal_error: vLLM crash, unhandled exception
- 503 service_unavailable: temporary capacity issue (GPU OOM, queue full)
- 504 gateway_timeout: request exceeded timeout (long generation)
- Custom: 422 output_validation_failed: structured output didn't parse (rare with guided decoding)
Use OpenAI's error response shape for compatibility:
{
"error": {
"message": "Context length exceeded: 35K tokens, max 32K",
"type": "invalid_request_error",
"param": "messages",
"code": "context_length_exceeded"
}
}
Retry semantics
- Retryable: 429 (after rate-limit window), 503, 504
- Not retryable: 400, 401, 403, 413, 422 (won't succeed without input change)
- Retry strategy: exponential backoff with jitter; max 3-5 retries
- Idempotency: client should pass
idempotency_keyfor safe retry
User-facing
Translate technical errors to actionable user messages:
- 413 → "Your input is too long; try shortening it" (not "context_length_exceeded")
- 429 → "You've hit your usage limit; upgrade or wait"
- 503 → "Service is temporarily slow; retrying…" (often handled silently with backoff)
Verdict
Production AI APIs need structured error responses, OpenAI-compatible shapes, sensible retry semantics, and user-friendly messages. Build day-one; clients depend on these conventions for resilient integration. Hosted-API conventions exist for a reason — copy them.
Bottom line
OpenAI-compatible errors; client-friendly retry. See OpenAI API guide.