Table of Contents
Soak testing — sustained load over 24-72 hours — catches issues that short load tests miss: slow memory leaks, gradual thermal throttling, KV cache fragmentation, log volume bottlenecks. For production AI deployments, run a soak test before launch; production traffic for the first week is essentially the soak test if you skip it.
Run synthetic production-like traffic at 70% of capacity for 24-72 hours. Watch for: GPU memory drift, thermal throttling onset, p99 latency degradation over time, log volume / disk fill, error rate accumulation. Resolve any drift before user-facing launch. Standard SRE practice; particularly important for AI given GPU thermal characteristics.
Why soak
Issues that short load tests don't catch:
- Memory leaks: vLLM / Python / CUDA leaks over hours
- Thermal accumulation: GPU temp climbs over 30+ minutes; eventual throttling
- KV cache fragmentation: gradual buildup affects performance
- Log volume disk fill: structured logs at full volume can fill disks faster than expected
- Connection pool exhaustion: PostgreSQL / Redis connections leak under load
- Cron / scheduled job interaction: nightly jobs vs sustained load
Setup
- Synthetic traffic generator: k6, Locust, or custom Python with realistic prompt distribution
- Target: 70% of expected peak production load
- Duration: 24-72 hours minimum; weekend run is convenient
- Log aggregation captures all metrics during run
- Alert on degradation thresholds during soak
What to watch
- GPU memory: should be stable; drift indicates leak
- GPU temperature: stable steady-state expected; climb indicates cooling issue
- p99 TTFT / TPOT: stable over time
- Error rate: 0% baseline expected; drift indicates accumulating issue
- vLLM queue depth: bounded; sustained growth indicates capacity issue
- Disk usage: log volume sustainable for retention window
- Connection pool sizes: stable
Verdict
Soak testing pre-launch is cheap insurance against the kind of incident that takes down production a week after deploy. ~£20 of GPU time + a weekend of synthetic traffic prevents the "everything was fine yesterday" class of failure. Standard SRE practice; particularly worthwhile for AI given GPU thermal and KV-cache dynamics.
Bottom line
Run a soak test before launch. See load test guide.