Teams transitioning from hosted API to self-hosted AI hit recurring pitfalls. Some surface immediately (cold start, capacity); others emerge weeks later (eval drift, cost creep). Awareness helps avoid them.
Common pitfalls: underestimating ops time, missing cold-start latency, no eval baseline before migration, no monitoring before going live, frontier-quality regression on hard cases, capacity surprise on launch traffic, cost creep from misconfigured caching, residency gaps. Each has an avoidance pattern; the surprise is usually preventable.
Pitfalls
- Underestimated ops time: "just run vLLM" vs the reality of monitoring, deploys, incident response
- Cold-start latency: 30-90s vLLM startup; users notice during deploys without blue-green
- No eval baseline before migration: can't prove quality didn't regress vs hosted
- No monitoring before going live: blind to production behaviour
- Frontier-quality regression on hard cases: open-weight covers 90%; the hard 10% needs hosted-API fallback
- Capacity surprise: load test passed; production surfaced patterns the test missed
- Cost creep: caching disabled by accident; KV cache pressure; over-provisioned headroom
- Residency gaps: discovered mid-enterprise-sale that some component still calls US-region service
Avoidance
- Budget realistic ops time (~0.5-1 FTE pro-rated)
- Blue-green deploys hide cold start
- Build eval harness BEFORE migrating; baseline on hosted API
- Observability stack live before traffic cutover
- Always include hosted-API fallback in routing
- Soak test pre-launch (24-72 hours sustained synthetic traffic)
- Verify caching enabled in production; track hit rates
- Audit data flows for residency early in design
Verdict
Self-hosted AI pitfalls are mostly preventable with honest planning. Budget ops time realistically; build observability + eval before traffic; always have fallback; soak test; track caching. The teams that transition smoothly do these consistently; the teams that struggle skip them.
Bottom line
Plan ops time honestly; build foundations first. See migration playbook.