RTX 3050 - Order Now
Home / Blog / Tutorials / Graceful Error Handling for AI APIs
Tutorials

Graceful Error Handling for AI APIs

Production-grade error handling for LLM APIs — structured errors, retry semantics, user-friendly messages.

Production AI APIs need careful error handling. LLM serving introduces error classes web apps don't have (context-window-exceeded, rate-limited by hosted fallback, malformed structured output). Standard error semantics + structured error codes + user-friendly messages prevent confusing failures.

TL;DR

Five error classes: 4xx user errors (rate limit, context-too-long, malformed input), 5xx infrastructure (GPU overloaded, model not loaded), quality errors (output validation failed), upstream errors (hosted-API fallback failed), quota errors (per-tenant budget exceeded). Return OpenAI-compatible error shapes; structured error codes for client-side handling.

Error classes

  • 400 invalid_request: malformed JSON, missing required fields
  • 401 unauthorised: missing / invalid API key
  • 403 forbidden: per-tenant tier doesn't allow this model / context-length
  • 413 context_length_exceeded: input + output exceeds max-model-len
  • 429 rate_limited: per-tenant rate / quota exceeded
  • 500 internal_error: vLLM crash, unhandled exception
  • 503 service_unavailable: temporary capacity issue (GPU OOM, queue full)
  • 504 gateway_timeout: request exceeded timeout (long generation)
  • Custom: 422 output_validation_failed: structured output didn't parse (rare with guided decoding)

Use OpenAI's error response shape for compatibility:

{
  "error": {
    "message": "Context length exceeded: 35K tokens, max 32K",
    "type": "invalid_request_error",
    "param": "messages",
    "code": "context_length_exceeded"
  }
}

Retry semantics

  • Retryable: 429 (after rate-limit window), 503, 504
  • Not retryable: 400, 401, 403, 413, 422 (won't succeed without input change)
  • Retry strategy: exponential backoff with jitter; max 3-5 retries
  • Idempotency: client should pass idempotency_key for safe retry

User-facing

Translate technical errors to actionable user messages:

  • 413 → "Your input is too long; try shortening it" (not "context_length_exceeded")
  • 429 → "You've hit your usage limit; upgrade or wait"
  • 503 → "Service is temporarily slow; retrying…" (often handled silently with backoff)

Verdict

Production AI APIs need structured error responses, OpenAI-compatible shapes, sensible retry semantics, and user-friendly messages. Build day-one; clients depend on these conventions for resilient integration. Hosted-API conventions exist for a reason — copy them.

Bottom line

OpenAI-compatible errors; client-friendly retry. See OpenAI API guide.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?