Table of Contents
After hundreds of customer deployments, the mistakes recur. This is the consolidated list.
Most expensive mistakes: not enabling FP8 (50% throughput left on table), not pinning model commits (silent regressions), over-spec'ing GPU (paying for capacity you don't use), skipping prefix caching (30-50% throughput).
The mistakes
- Not enabling FP8 on Blackwell — leaves 50% throughput unclaimed
- Not pinning model commit SHAs — quality regresses silently when HF hub tags move
- Over-spec'ing GPU — running embeddings-only on a 5090
- Skipping prefix caching — 30-50% free throughput ignored
- Default vLLM
max-num-seqs— 256 is too high for 16-24 GB cards, OOMs under load - Putting Ollama in front of paying users — production needs vLLM or TGI
- No eval harness — silent quality regression
- No fallback model — a 70B outage with no plan B is a bad afternoon
Verdict
Each mistake is fixable in a config change. Each one costs real money or quality.
Bottom line
Audit your deployment against this list. Most teams hit 3-4 of these on first ship. See build a production AI inference server.