RTX 3050 - Order Now
Home / Blog / Tutorials / Graceful Shutdown of vLLM in Production
Tutorials

Graceful Shutdown of vLLM in Production

Killing a vLLM process drops in-flight requests. Handling SIGTERM properly lets requests finish before the process exits.

Signal handling in vLLM matters during deployments on dedicated GPU hosting. SIGKILL ends requests instantly with client-visible errors. SIGTERM with proper handling lets in-flight requests finish before the process exits. A few systemd settings make the difference.

Contents

Signals

  • SIGTERM (default systemd stop signal): vLLM stops accepting new requests, waits for in-flight to finish, then exits
  • SIGKILL: instant termination, in-flight requests error out
  • SIGINT (Ctrl-C): behaves like SIGTERM

The default systemd timeout is 90 seconds before escalating from SIGTERM to SIGKILL. A 70B model generating long responses can exceed this.

systemd Unit

[Unit]
Description=vLLM Inference Server
After=network.target

[Service]
User=vllm
WorkingDirectory=/opt/vllm
ExecStart=/opt/vllm/bin/python -m vllm.entrypoints.openai.api_server --model ...
Restart=on-failure
RestartSec=5s

KillSignal=SIGTERM
TimeoutStopSec=300
KillMode=mixed

[Install]
WantedBy=multi-user.target

TimeoutStopSec=300 gives 5 minutes for in-flight requests to finish. KillMode=mixed sends SIGTERM to the main process first, then SIGKILL to any stragglers.

Drain

For a true zero-drop shutdown:

  1. Remove the replica from the load balancer upstream
  2. Wait 30-60 seconds for requests already routed to finish arriving
  3. Send SIGTERM
  4. Wait for process exit

Automate via a pre-stop script.

Verify

Run a load test while triggering a restart:

systemctl restart vllm

Check client logs – no 5xx errors during the restart means your timeout is long enough.

Production-Grade vLLM Hosting

UK dedicated GPUs with systemd units, timeouts, and signal handling preconfigured.

Browse GPU Servers

See rolling upgrade and systemd for AI inference.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?