Table of Contents
For private / self-hosted LLM deployments, the gap between "works on my laptop" and "production-ready" is dozens of small things. Use this checklist as the reference; missing items cause real production incidents.
Pre-deploy: model pinned to commit SHA; eval harness covers representative queries; load test passes. Deploy: vLLM + nginx + auth + structured logging + DCGM monitoring. Post-deploy: alerts configured; backup + recovery tested; eval harness runs on changes; security review. Compliance: audit logs, encryption at rest + in transit, RBAC, retention policies.
Pre-deploy
- Model checkpoint pinned to commit SHA (not tag)
- Eval harness with 200-500 representative queries built and passing
- Load test results verify capacity at target concurrency
- System prompt versioned; templates pulled from config not hardcoded
- VRAM budget verified at peak load (not just steady-state)
- FP8 / AWQ / quantisation choice validated against eval harness
Deploy
- vLLM with
--enable-prefix-caching+--kv-cache-dtype fp8_e5m2 - nginx reverse proxy with TLS 1.2+ (1.3 preferred)
- API-key auth (per-tenant if multi-tenant)
- Per-key rate limiting
- Streaming SSE config:
proxy_buffering off - Timeouts: client > nginx > vLLM hierarchy
- systemd service with
Restart=on-failure - Graceful shutdown (drain in-flight requests)
Post-deploy
- DCGM Exporter + Prometheus + Grafana running
- Structured JSON logs shipped to Loki / Elasticsearch / Postgres
- Alerts configured: GPU temp > 82°C, p99 TTFT > SLO, queue depth > threshold, error rate
- Backup tested: model weights, vector store, configs
- Recovery procedure documented + practiced
- Eval harness runs in CI on every model / prompt change
- Security review: dependency scan, OS patches, network exposure audit
- Cost monitoring: track £/M tokens vs budget
Compliance
- Audit logs retained per regulatory requirements (typically 90 days hot, 7 years cold for finance / health)
- Encryption at rest (filesystem-level or ZFS native)
- Encryption in transit (TLS 1.2+)
- RBAC enforced for admin access; MFA required (phishing-resistant for SOC 2 v4.0+)
- Data retention + deletion policies (GDPR right to erasure)
- Vendor management: any external API providers reviewed and documented
- Incident response plan: runbook for security incidents, model drift, capacity issues
Verdict
This checklist is the difference between a deployment that works and one that survives a production incident. Most items take ~30-60 minutes individually; together they form a few days of focused work. Skipping the checklist is the most common cause of preventable AI production incidents.
Bottom line
Use this checklist before going live. See OpenAI-compatible API guide and structured logging.