Table of Contents
Recurring questions from teams evaluating self-hosted AI. Honest answers. Most concerns are addressable; some genuinely point at staying on hosted API.
Common Qs: how much can we save (~50-90% at scale), how hard is it (~2-4 weeks setup + ongoing ops), is quality good enough (yes for ~90% of tasks; frontier API for hardest 5-10%), can we still scale (yes; ~2-3× capacity per tier), is it future-proof (open-weight ecosystem accelerating).
Cost questions
Q: How much can we actually save? A: 50-90% on production traffic vs frontier API at >30M tokens/month. Smaller savings at lower volume; savings compound with growth.
Q: What about ops cost? A: Real and worth budgeting. ~0.5-1 FTE pro-rated; ~£500-3,000/month depending on scale. Still net saving above ~50M tokens/month.
Q: What if we don't scale that big? A: Stay on hosted API; self-hosted economics dominate at scale. Below ~30M tokens/month, hosted API often simpler.
Ops questions
Q: How hard is it really? A: 2-4 weeks for production-grade initial setup. Ongoing: ~10 hours/week ops engineering for a moderate deployment.
Q: What if our GPU fails? A: Standard DR pattern: warm standby + DNS failover + replicated vector store. RTO < 5 minutes achievable.
Q: What about model updates? A: Blue-green deploy + eval-gated canary rollout. ~2 weeks per major model upgrade with proper process.
Quality questions
Q: Is quality really comparable to GPT-4? A: For ~90% of production tasks, yes (Llama 3.3 70B / Qwen 2.5 72B). For frontier-hardest 5-10%, hosted API still leads. Hybrid is the answer.
Q: What about new model capabilities? A: Open-weight ecosystem moves fast; new capabilities (reasoning, multimodal, long context) typically arrive in open-weight 3-6 months after frontier API.
Q: Can we fine-tune for our domain? A: Yes — QLoRA fine-tuning is mature; multi-LoRA serving allows per-tenant variants. Frontier APIs offer limited fine-tuning compared to self-hosted flexibility.
Verdict
Most concerns about self-hosted AI are addressable. Cost saving is real; ops cost is real but worth budgeting; quality is sufficient for ~90% of tasks; ecosystem is mature. The right answer for most production deployments above SMB scale is self-hosted with hosted-API fallback. For experimental / very low volume / frontier-quality-critical workloads, hosted API remains right.
Bottom line
Most concerns addressable; honest answers help decision-making. See self vs API.