Your Inference API Died at 2 AM and Nobody Noticed
A GPU Xid error crashed the CUDA context at 2 AM. The vLLM process exited, the systemd service did not restart because of a misconfigured restart policy, and customers hit errors for six hours until someone checked Slack. GPU servers encounter CUDA faults, OOM kills, driver hangs, and hardware glitches that are unavoidable over long uptimes. The only defense is automated recovery that brings inference back online within seconds. Every dedicated GPU server running production AI needs a layered recovery strategy.
Systemd Restart Policies for Inference
Proper systemd configuration handles most crash scenarios automatically:
# /etc/systemd/system/vllm-inference.service
[Unit]
Description=vLLM Inference Server
After=network-online.target nvidia-persistenced.service
Wants=network-online.target
[Service]
Type=simple
User=inference
ExecStart=/opt/envs/vllm/bin/python -m vllm.entrypoints.openai.api_server \
--model /opt/models/llama-3-70b --port 8000
# Restart on ANY exit (crash, signal, OOM, etc.)
Restart=always
RestartSec=10
StartLimitIntervalSec=300
StartLimitBurst=5
# If it crashes 5 times in 300 seconds, stop trying
# but a separate watchdog can reset and re-enable
# OOM protection: kill this service last
OOMScoreAdjust=-900
# Environment
Environment=CUDA_VISIBLE_DEVICES=0,1
Environment=CUDA_HOME=/usr/local/cuda
# Resource limits
LimitNOFILE=65535
LimitMEMLOCK=infinity
[Install]
WantedBy=multi-user.target
# Reload and enable
sudo systemctl daemon-reload
sudo systemctl enable --now vllm-inference
Automated GPU Reset After Faults
Some GPU failures require a device reset before processes can use the GPU again:
#!/bin/bash
# /opt/scripts/gpu-reset-recovery.sh
# Attempts to recover GPUs from fault states
LOG="/var/log/gpu-recovery.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
check_gpu_health() {
# Returns 0 if healthy, 1 if faulty
nvidia-smi --query-gpu=index,gpu_bus_id --format=csv,noheader 2>/dev/null
return $?
}
reset_gpu() {
local GPU_ID=$1
echo "$TIMESTAMP: Resetting GPU $GPU_ID" | tee -a "$LOG"
# Kill any processes using this GPU
nvidia-smi --id=$GPU_ID --query-compute-apps=pid --format=csv,noheader | \
xargs -r kill -9 2>/dev/null
sleep 2
# Attempt GPU reset
nvidia-smi --id=$GPU_ID -r 2>/dev/null
local RESULT=$?
if [ $RESULT -eq 0 ]; then
echo "$TIMESTAMP: GPU $GPU_ID reset successful" | tee -a "$LOG"
else
echo "$TIMESTAMP: GPU $GPU_ID reset FAILED — may need server reboot" | tee -a "$LOG"
fi
return $RESULT
}
# Check if nvidia-smi itself is responsive
if ! nvidia-smi > /dev/null 2>&1; then
echo "$TIMESTAMP: nvidia-smi unresponsive — GPU driver may be hung" | tee -a "$LOG"
# Attempt driver reload (last resort before reboot)
sudo rmmod nvidia_uvm nvidia_drm nvidia_modeset nvidia 2>/dev/null
sleep 3
sudo modprobe nvidia
sleep 2
if ! nvidia-smi > /dev/null 2>&1; then
echo "$TIMESTAMP: Driver reload failed — initiating reboot" | tee -a "$LOG"
sudo reboot
fi
fi
# Check each GPU for Xid errors
dmesg | grep -i "xid" | tail -5 | while read -r line; do
GPU_ID=$(echo "$line" | grep -oP 'GPU \K[0-9]+' || echo "0")
echo "$TIMESTAMP: Xid error detected on GPU $GPU_ID" | tee -a "$LOG"
reset_gpu "$GPU_ID"
done
Health Check Watchdog
Monitor the inference endpoint and trigger recovery when it stops responding:
#!/bin/bash
# /opt/scripts/inference-watchdog.sh
ENDPOINT="http://127.0.0.1:8000/v1/models"
SERVICE="vllm-inference"
MAX_FAILURES=3
FAIL_COUNT=0
CHECK_INTERVAL=30
while true; do
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
--max-time 10 "$ENDPOINT" 2>/dev/null)
if [ "$HTTP_CODE" -ne 200 ]; then
FAIL_COUNT=$((FAIL_COUNT + 1))
echo "$(date): Health check failed ($HTTP_CODE) — failure $FAIL_COUNT/$MAX_FAILURES"
if [ "$FAIL_COUNT" -ge "$MAX_FAILURES" ]; then
echo "$(date): Max failures reached. Restarting $SERVICE..."
logger -p user.err -t "inference-watchdog" \
"Restarting $SERVICE after $MAX_FAILURES consecutive failures"
# Check GPU health first
if ! nvidia-smi > /dev/null 2>&1; then
/opt/scripts/gpu-reset-recovery.sh
sleep 10
fi
sudo systemctl restart "$SERVICE"
FAIL_COUNT=0
sleep 60 # Grace period for model loading
fi
else
if [ "$FAIL_COUNT" -gt 0 ]; then
echo "$(date): Service recovered after $FAIL_COUNT failures"
fi
FAIL_COUNT=0
fi
sleep "$CHECK_INTERVAL"
done
# Run watchdog as a service
# /etc/systemd/system/inference-watchdog.service
[Unit]
Description=Inference Health Watchdog
After=vllm-inference.service
[Service]
Type=simple
ExecStart=/opt/scripts/inference-watchdog.sh
Restart=always
[Install]
WantedBy=multi-user.target
OOM Kill Recovery
# Monitor OOM events
journalctl -k | grep -i "oom\|killed process" | tail -10
# Prevent inference from being OOM-killed
# Add to the service file
OOMScoreAdjust=-900
# Set up cgroup memory limits instead of relying on OOM killer
# /etc/systemd/system/vllm-inference.service.d/memory.conf
[Service]
MemoryMax=60G
MemoryHigh=55G
# Auto-cleanup GPU memory after OOM
cat <<'EOF' > /opt/scripts/gpu-mem-cleanup.py
import subprocess
result = subprocess.run(
["nvidia-smi", "--query-compute-apps=pid,used_memory",
"--format=csv,noheader,nounits"],
capture_output=True, text=True
)
for line in result.stdout.strip().split("\n"):
if line:
pid, mem = line.split(", ")
# Check if PID still exists
try:
subprocess.run(["kill", "-0", pid], check=True,
capture_output=True)
except subprocess.CalledProcessError:
print(f"Orphaned GPU memory: PID {pid} using {mem}MB")
subprocess.run(["kill", "-9", pid], capture_output=True)
EOF
Test Your Recovery Pipeline
# Simulate an inference crash
sudo systemctl kill --signal=SIGKILL vllm-inference
# Watch automatic recovery
journalctl -u vllm-inference -f
# Verify recovery time
START=$(date +%s)
sudo systemctl kill --signal=SIGKILL vllm-inference
while ! curl -s http://localhost:8000/v1/models > /dev/null 2>&1; do
sleep 1
done
END=$(date +%s)
echo "Recovery time: $((END - START)) seconds"
Automated recovery keeps your GPU server inference online through GPU faults, OOM events, and process crashes. Deploy vLLM with the production guide for robust service configuration. Monitor Ollama services similarly. Track failures with our monitoring setup. Browse infrastructure guides, tutorials, and benchmarks.
Always-On AI Infrastructure
GigaGPU dedicated GPU servers with full root access and IPMI. Build self-healing inference pipelines that recover automatically.
Browse GPU Servers