Signal handling in vLLM matters during deployments on dedicated GPU hosting. SIGKILL ends requests instantly with client-visible errors. SIGTERM with proper handling lets in-flight requests finish before the process exits. A few systemd settings make the difference.
Contents
Signals
- SIGTERM (default systemd stop signal): vLLM stops accepting new requests, waits for in-flight to finish, then exits
- SIGKILL: instant termination, in-flight requests error out
- SIGINT (Ctrl-C): behaves like SIGTERM
The default systemd timeout is 90 seconds before escalating from SIGTERM to SIGKILL. A 70B model generating long responses can exceed this.
systemd Unit
[Unit]
Description=vLLM Inference Server
After=network.target
[Service]
User=vllm
WorkingDirectory=/opt/vllm
ExecStart=/opt/vllm/bin/python -m vllm.entrypoints.openai.api_server --model ...
Restart=on-failure
RestartSec=5s
KillSignal=SIGTERM
TimeoutStopSec=300
KillMode=mixed
[Install]
WantedBy=multi-user.target
TimeoutStopSec=300 gives 5 minutes for in-flight requests to finish. KillMode=mixed sends SIGTERM to the main process first, then SIGKILL to any stragglers.
Drain
For a true zero-drop shutdown:
- Remove the replica from the load balancer upstream
- Wait 30-60 seconds for requests already routed to finish arriving
- Send SIGTERM
- Wait for process exit
Automate via a pre-stop script.
Verify
Run a load test while triggering a restart:
systemctl restart vllm
Check client logs – no 5xx errors during the restart means your timeout is long enough.
Production-Grade vLLM Hosting
UK dedicated GPUs with systemd units, timeouts, and signal handling preconfigured.
Browse GPU ServersSee rolling upgrade and systemd for AI inference.