Load balancers and orchestrators depend on health check endpoints to route traffic. A naive HTTP 200 response is insufficient for an LLM service – the server can be up but the model not loaded. On our dedicated GPU hosting the right pattern splits liveness from readiness.
Contents
Two Kinds
- Liveness: is the process alive? If no, restart.
- Readiness: is the service ready to handle requests? If no, don’t route traffic.
Both fail during startup. Only readiness fails during a temporary overload.
Liveness
vLLM exposes /health which returns 200 once the server has loaded. For liveness, this is enough. Kubernetes config:
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120 # model load takes time
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
Readiness
For readiness, you want a check that actually exercises inference. An approach: a /v1/models call plus a tiny generation:
curl -sf http://localhost:8000/v1/models && \
curl -sf -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"$MODEL","prompt":"hi","max_tokens":1}'
Wrap this in a readiness script that returns 0 on success. If either fails, the replica is not ready.
LB Integration
nginx active health checks (nginx Plus or ngx_http_upstream_check_module):
upstream llm {
server replica1:8000 max_fails=3 fail_timeout=30s;
server replica2:8000 max_fails=3 fail_timeout=30s;
}
Passive checks (default nginx). After 3 failed responses in 30 seconds, nginx stops sending traffic to the replica for 30 seconds. Not as responsive as active probes but zero-config.
For Kubernetes, readiness probes integrate with Service endpoints automatically.
Production LLM API Hosting
Health-checked UK dedicated hosting with load balancer integration.
Browse GPU ServersSee graceful shutdown and load balancer.