Home / Blog / Tutorials / Health Check Endpoints for an LLM API

Tutorials

Health Check Endpoints for an LLM API

Liveness and readiness probes for a self-hosted LLM API - what each should check and how to configure them for load balancer integration.

Tutorials April 23, 2026 2 min read admin

Load balancers and orchestrators depend on health check endpoints to route traffic. A naive HTTP 200 response is insufficient for an LLM service – the server can be up but the model not loaded. On our dedicated GPU hosting the right pattern splits liveness from readiness.

Two kinds of check
Liveness
Readiness
LB integration

Two Kinds

Liveness: is the process alive? If no, restart.
Readiness: is the service ready to handle requests? If no, don’t route traffic.

Both fail during startup. Only readiness fails during a temporary overload.

Liveness

vLLM exposes /health which returns 200 once the server has loaded. For liveness, this is enough. Kubernetes config:

livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 120  # model load takes time
  periodSeconds: 30
  timeoutSeconds: 5
  failureThreshold: 3

Readiness

For readiness, you want a check that actually exercises inference. An approach: a /v1/models call plus a tiny generation:

curl -sf http://localhost:8000/v1/models && \
curl -sf -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"$MODEL","prompt":"hi","max_tokens":1}'

Wrap this in a readiness script that returns 0 on success. If either fails, the replica is not ready.

LB Integration

nginx active health checks (nginx Plus or ngx_http_upstream_check_module):

upstream llm {
    server replica1:8000 max_fails=3 fail_timeout=30s;
    server replica2:8000 max_fails=3 fail_timeout=30s;
}

Passive checks (default nginx). After 3 failed responses in 30 seconds, nginx stops sending traffic to the replica for 30 seconds. Not as responsive as active probes but zero-config.

For Kubernetes, readiness probes integrate with Service endpoints automatically.

Production LLM API Hosting

Health-checked UK dedicated hosting with load balancer integration.

Browse GPU Servers

See graceful shutdown and load balancer.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Health Check Endpoints for an LLM API

Contents

Two Kinds

Liveness

Readiness

LB Integration

Production LLM API Hosting

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Health Check Endpoints for an LLM API

Contents

Two Kinds

Liveness

Readiness

LB Integration

Production LLM API Hosting

Need a Dedicated GPU Server?

admin

Related Articles

Audio Format Conversion for AI: FFmpeg Guide

Semantic Kernel vs LangChain

Meeting Notes Pipeline with Whisper and LLM

Migrate from Google Vertex to Dedicated GPU: Translation Pipeline Guide

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?