Table of Contents
Print this. Tick the boxes before going live.
Five categories, ~30 checkboxes total. Skipping any one of them creates an incident waiting to happen. Most teams hit 3-5 misses on first launch.
Hardware
- ☐ GPU sized for the largest model you'll run in 6 months
- ☐ Sufficient VRAM headroom for KV cache (2-8 GB depending on context)
- ☐ FP8 hardware path if running modern open-weight models
- ☐ Single-tenant bare-metal (not multi-tenant cloud)
- ☐ Datacenter-grade cooling (not consumer chassis in a closet)
Software
- ☐ Ubuntu 22.04 LTS pinned
- ☐ NVIDIA driver pinned (e.g., 555.42)
- ☐ CUDA toolkit pinned
- ☐ vLLM pinned (e.g., 0.6.3)
- ☐ Model commit SHA pinned (not tag)
- ☐
--enable-prefix-cachingon - ☐ FP8 quantisation enabled
- ☐ FP8 KV cache enabled if memory-tight
Operations
- ☐ systemd unit for vLLM with Restart=on-failure
- ☐ Prometheus + DCGM exporter scraping
- ☐ Grafana dashboard (TTFT, queue depth, GPU mem)
- ☐ Alerts on p99 TTFT, queue depth, GPU mem util
- ☐ Structured request logs to SIEM
- ☐ On-call runbook documented
- ☐ Backup / restore tested
- ☐ LiteLLM in front for auth + rate limiting
- ☐ Caddy / Cloudflare for TLS
Compliance
- ☐ DPA signed with hosting provider
- ☐ DPIA completed if processing personal data
- ☐ Sub-processor list documented
- ☐ Retention policy defined for prompts/responses
- ☐ Privacy notice updated to disclose AI processing
Evaluation
- ☐ Eval harness with 200-prompt gold set
- ☐ LLM-judge scoring set up
- ☐ Baseline scores recorded
- ☐ CI integration for model upgrades
- ☐ Regression alert threshold (e.g., >3%)
Bottom line
The boring items are the ones that bite. Tick every box. See build a production AI inference server and enterprise AI architecture checklist.