RTX 3050 - Order Now
Home / Blog / Tutorials / Health Check Endpoints for an LLM API
Tutorials

Health Check Endpoints for an LLM API

Liveness and readiness probes for a self-hosted LLM API - what each should check and how to configure them for load balancer integration.

Load balancers and orchestrators depend on health check endpoints to route traffic. A naive HTTP 200 response is insufficient for an LLM service – the server can be up but the model not loaded. On our dedicated GPU hosting the right pattern splits liveness from readiness.

Contents

Two Kinds

  • Liveness: is the process alive? If no, restart.
  • Readiness: is the service ready to handle requests? If no, don’t route traffic.

Both fail during startup. Only readiness fails during a temporary overload.

Liveness

vLLM exposes /health which returns 200 once the server has loaded. For liveness, this is enough. Kubernetes config:

livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 120  # model load takes time
  periodSeconds: 30
  timeoutSeconds: 5
  failureThreshold: 3

Readiness

For readiness, you want a check that actually exercises inference. An approach: a /v1/models call plus a tiny generation:

curl -sf http://localhost:8000/v1/models && \
curl -sf -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"$MODEL","prompt":"hi","max_tokens":1}'

Wrap this in a readiness script that returns 0 on success. If either fails, the replica is not ready.

LB Integration

nginx active health checks (nginx Plus or ngx_http_upstream_check_module):

upstream llm {
    server replica1:8000 max_fails=3 fail_timeout=30s;
    server replica2:8000 max_fails=3 fail_timeout=30s;
}

Passive checks (default nginx). After 3 failed responses in 30 seconds, nginx stops sending traffic to the replica for 30 seconds. Not as responsive as active probes but zero-config.

For Kubernetes, readiness probes integrate with Service endpoints automatically.

Production LLM API Hosting

Health-checked UK dedicated hosting with load balancer integration.

Browse GPU Servers

See graceful shutdown and load balancer.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?