Home / Blog / Tutorials / Graceful Shutdown of vLLM in Production

Tutorials

Graceful Shutdown of vLLM in Production

Killing a vLLM process drops in-flight requests. Handling SIGTERM properly lets requests finish before the process exits.

Tutorials April 23, 2026 1 min read admin

Signal handling in vLLM matters during deployments on dedicated GPU hosting. SIGKILL ends requests instantly with client-visible errors. SIGTERM with proper handling lets in-flight requests finish before the process exits. A few systemd settings make the difference.

Signal handling
systemd unit
Drain before shutdown
Verify it works

Signals

SIGTERM (default systemd stop signal): vLLM stops accepting new requests, waits for in-flight to finish, then exits
SIGKILL: instant termination, in-flight requests error out
SIGINT (Ctrl-C): behaves like SIGTERM

The default systemd timeout is 90 seconds before escalating from SIGTERM to SIGKILL. A 70B model generating long responses can exceed this.

systemd Unit

[Unit]
Description=vLLM Inference Server
After=network.target

[Service]
User=vllm
WorkingDirectory=/opt/vllm
ExecStart=/opt/vllm/bin/python -m vllm.entrypoints.openai.api_server --model ...
Restart=on-failure
RestartSec=5s

KillSignal=SIGTERM
TimeoutStopSec=300
KillMode=mixed

[Install]
WantedBy=multi-user.target

TimeoutStopSec=300 gives 5 minutes for in-flight requests to finish. KillMode=mixed sends SIGTERM to the main process first, then SIGKILL to any stragglers.

Drain

For a true zero-drop shutdown:

Remove the replica from the load balancer upstream
Wait 30-60 seconds for requests already routed to finish arriving
Send SIGTERM
Wait for process exit

Automate via a pre-stop script.

Verify

Run a load test while triggering a restart:

systemctl restart vllm

Check client logs – no 5xx errors during the restart means your timeout is long enough.

Production-Grade vLLM Hosting

UK dedicated GPUs with systemd units, timeouts, and signal handling preconfigured.

Browse GPU Servers

See rolling upgrade and systemd for AI inference.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Graceful Shutdown of vLLM in Production

Contents

Signals

systemd Unit

Drain

Verify

Production-Grade vLLM Hosting

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Graceful Shutdown of vLLM in Production

Contents

Signals

systemd Unit

Drain

Verify

Production-Grade vLLM Hosting

Need a Dedicated GPU Server?

admin

Related Articles

RTX 5060 Ti 16GB Whisper API Setup

Migrate from AWS Bedrock to Dedicated GPU: Multi-Model Pipeline Guide

Connect Sentry to AI Inference Error Tracking

How to Set Up LlamaIndex on a Dedicated GPU Server

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?