RTX 3050 - Order Now
Home / Blog / AI Hosting & Infrastructure / Auto GPU Recovery After Crashes
AI Hosting & Infrastructure

Auto GPU Recovery After Crashes

Automate GPU server recovery after crashes. Covers systemd restart policies, GPU reset procedures, watchdog scripts, health checks, OOM recovery, and ensuring AI inference stays online on dedicated servers.

Your Inference API Died at 2 AM and Nobody Noticed

A GPU Xid error crashed the CUDA context at 2 AM. The vLLM process exited, the systemd service did not restart because of a misconfigured restart policy, and customers hit errors for six hours until someone checked Slack. GPU servers encounter CUDA faults, OOM kills, driver hangs, and hardware glitches that are unavoidable over long uptimes. The only defense is automated recovery that brings inference back online within seconds. Every dedicated GPU server running production AI needs a layered recovery strategy.

Systemd Restart Policies for Inference

Proper systemd configuration handles most crash scenarios automatically:

# /etc/systemd/system/vllm-inference.service
[Unit]
Description=vLLM Inference Server
After=network-online.target nvidia-persistenced.service
Wants=network-online.target

[Service]
Type=simple
User=inference
ExecStart=/opt/envs/vllm/bin/python -m vllm.entrypoints.openai.api_server \
    --model /opt/models/llama-3-70b --port 8000

# Restart on ANY exit (crash, signal, OOM, etc.)
Restart=always
RestartSec=10
StartLimitIntervalSec=300
StartLimitBurst=5

# If it crashes 5 times in 300 seconds, stop trying
# but a separate watchdog can reset and re-enable

# OOM protection: kill this service last
OOMScoreAdjust=-900

# Environment
Environment=CUDA_VISIBLE_DEVICES=0,1
Environment=CUDA_HOME=/usr/local/cuda

# Resource limits
LimitNOFILE=65535
LimitMEMLOCK=infinity

[Install]
WantedBy=multi-user.target

# Reload and enable
sudo systemctl daemon-reload
sudo systemctl enable --now vllm-inference

Automated GPU Reset After Faults

Some GPU failures require a device reset before processes can use the GPU again:

#!/bin/bash
# /opt/scripts/gpu-reset-recovery.sh
# Attempts to recover GPUs from fault states

LOG="/var/log/gpu-recovery.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')

check_gpu_health() {
    # Returns 0 if healthy, 1 if faulty
    nvidia-smi --query-gpu=index,gpu_bus_id --format=csv,noheader 2>/dev/null
    return $?
}

reset_gpu() {
    local GPU_ID=$1
    echo "$TIMESTAMP: Resetting GPU $GPU_ID" | tee -a "$LOG"

    # Kill any processes using this GPU
    nvidia-smi --id=$GPU_ID --query-compute-apps=pid --format=csv,noheader | \
        xargs -r kill -9 2>/dev/null

    sleep 2

    # Attempt GPU reset
    nvidia-smi --id=$GPU_ID -r 2>/dev/null
    local RESULT=$?

    if [ $RESULT -eq 0 ]; then
        echo "$TIMESTAMP: GPU $GPU_ID reset successful" | tee -a "$LOG"
    else
        echo "$TIMESTAMP: GPU $GPU_ID reset FAILED — may need server reboot" | tee -a "$LOG"
    fi
    return $RESULT
}

# Check if nvidia-smi itself is responsive
if ! nvidia-smi > /dev/null 2>&1; then
    echo "$TIMESTAMP: nvidia-smi unresponsive — GPU driver may be hung" | tee -a "$LOG"
    # Attempt driver reload (last resort before reboot)
    sudo rmmod nvidia_uvm nvidia_drm nvidia_modeset nvidia 2>/dev/null
    sleep 3
    sudo modprobe nvidia
    sleep 2

    if ! nvidia-smi > /dev/null 2>&1; then
        echo "$TIMESTAMP: Driver reload failed — initiating reboot" | tee -a "$LOG"
        sudo reboot
    fi
fi

# Check each GPU for Xid errors
dmesg | grep -i "xid" | tail -5 | while read -r line; do
    GPU_ID=$(echo "$line" | grep -oP 'GPU \K[0-9]+' || echo "0")
    echo "$TIMESTAMP: Xid error detected on GPU $GPU_ID" | tee -a "$LOG"
    reset_gpu "$GPU_ID"
done

Health Check Watchdog

Monitor the inference endpoint and trigger recovery when it stops responding:

#!/bin/bash
# /opt/scripts/inference-watchdog.sh
ENDPOINT="http://127.0.0.1:8000/v1/models"
SERVICE="vllm-inference"
MAX_FAILURES=3
FAIL_COUNT=0
CHECK_INTERVAL=30

while true; do
    HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
        --max-time 10 "$ENDPOINT" 2>/dev/null)

    if [ "$HTTP_CODE" -ne 200 ]; then
        FAIL_COUNT=$((FAIL_COUNT + 1))
        echo "$(date): Health check failed ($HTTP_CODE) — failure $FAIL_COUNT/$MAX_FAILURES"

        if [ "$FAIL_COUNT" -ge "$MAX_FAILURES" ]; then
            echo "$(date): Max failures reached. Restarting $SERVICE..."
            logger -p user.err -t "inference-watchdog" \
                "Restarting $SERVICE after $MAX_FAILURES consecutive failures"

            # Check GPU health first
            if ! nvidia-smi > /dev/null 2>&1; then
                /opt/scripts/gpu-reset-recovery.sh
                sleep 10
            fi

            sudo systemctl restart "$SERVICE"
            FAIL_COUNT=0
            sleep 60  # Grace period for model loading
        fi
    else
        if [ "$FAIL_COUNT" -gt 0 ]; then
            echo "$(date): Service recovered after $FAIL_COUNT failures"
        fi
        FAIL_COUNT=0
    fi
    sleep "$CHECK_INTERVAL"
done

# Run watchdog as a service
# /etc/systemd/system/inference-watchdog.service
[Unit]
Description=Inference Health Watchdog
After=vllm-inference.service

[Service]
Type=simple
ExecStart=/opt/scripts/inference-watchdog.sh
Restart=always

[Install]
WantedBy=multi-user.target

OOM Kill Recovery

# Monitor OOM events
journalctl -k | grep -i "oom\|killed process" | tail -10

# Prevent inference from being OOM-killed
# Add to the service file
OOMScoreAdjust=-900

# Set up cgroup memory limits instead of relying on OOM killer
# /etc/systemd/system/vllm-inference.service.d/memory.conf
[Service]
MemoryMax=60G
MemoryHigh=55G

# Auto-cleanup GPU memory after OOM
cat <<'EOF' > /opt/scripts/gpu-mem-cleanup.py
import subprocess
result = subprocess.run(
    ["nvidia-smi", "--query-compute-apps=pid,used_memory",
     "--format=csv,noheader,nounits"],
    capture_output=True, text=True
)
for line in result.stdout.strip().split("\n"):
    if line:
        pid, mem = line.split(", ")
        # Check if PID still exists
        try:
            subprocess.run(["kill", "-0", pid], check=True,
                           capture_output=True)
        except subprocess.CalledProcessError:
            print(f"Orphaned GPU memory: PID {pid} using {mem}MB")
            subprocess.run(["kill", "-9", pid], capture_output=True)
EOF

Test Your Recovery Pipeline

# Simulate an inference crash
sudo systemctl kill --signal=SIGKILL vllm-inference

# Watch automatic recovery
journalctl -u vllm-inference -f

# Verify recovery time
START=$(date +%s)
sudo systemctl kill --signal=SIGKILL vllm-inference
while ! curl -s http://localhost:8000/v1/models > /dev/null 2>&1; do
    sleep 1
done
END=$(date +%s)
echo "Recovery time: $((END - START)) seconds"

Automated recovery keeps your GPU server inference online through GPU faults, OOM events, and process crashes. Deploy vLLM with the production guide for robust service configuration. Monitor Ollama services similarly. Track failures with our monitoring setup. Browse infrastructure guides, tutorials, and benchmarks.

Always-On AI Infrastructure

GigaGPU dedicated GPU servers with full root access and IPMI. Build self-healing inference pipelines that recover automatically.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?