Table of Contents
This is the field-guide summary of self-hosted AI in April 2026. Pulled together from 1,000 production-pattern posts on dedicated GPU AI infrastructure. The reference for teams committing to self-hosted as the production architecture.
Self-hosted dedicated GPU is the production default for AI deployments above ~30M tokens/month or with residency requirements. Stack: vLLM + Mistral / Llama / Qwen 7B-70B + BGE embeddings + Qdrant + LiteLLM router + frontier API fallback. Ops: DCGM + Prometheus + structured logs + eval harness + on-call. UK / EU residency simplifies most regulatory frameworks. Hybrid (self-hosted + frontier API) is the dominant production pattern.
When self-hosted
- Cost: above ~30M tokens/month, dedicated GPU dominates per-token API
- Residency: UK / EU regulated data — self-hosted in region simplifies compliance
- Custom fine-tuning: per-tenant LoRAs / domain-specific behaviour
- Predictable cost: fixed monthly budget vs variable per-token
- Data sovereignty: avoid third-party AI vendor in data path
Stay on hosted API when: pre-Series-A experimentation, bursty workloads, frontier-model quality required for > 50% of traffic, no ops capacity.
The stack
- Hardware: 5060 Ti (£119/mo) for SMB 7B; 4090 (£289) for 13B; 5090 (£399) for 14B+ premium; 6000 Pro (£899) for 70B FP8
- Models: Llama 3.1 8B (general), Mistral 7B (English), Qwen 2.5 7B (multilingual), Llama 3.3 70B (frontier-class)
- Serving: vLLM (default), TensorRT-LLM (max throughput), SGLang (structured / agent)
- RAG: BGE-large + BGE-reranker-v2-m3 + Qdrant; hybrid search; contextual retrieval at indexing
- Routing: LiteLLM with self-hosted primary + frontier API fallback for hardest 5-10%
- Custom: TRL + PEFT QLoRA for fine-tuning; LoRAX / vLLM
--enable-lorafor multi-tenant
Ops
- Observability: DCGM Exporter + Prometheus + Grafana + structured JSON logs + OpenTelemetry traces
- Eval: RAGAS + custom harness; CI gate on every change
- Deploy: blue-green with eval-gated canary; feature-flag rollback path
- Compliance: per-tenant collections; comprehensive audit logs; UK / EU residency
- Cost: per-tenant attribution; per-feature cost; semantic + prefix caching
- Team: 4-5 roles (app, infra, ML/eval, data); 50-500-person org runs comfortably on a 4090
Verdict
Self-hosted dedicated GPU AI is the production default in 2026 for any deployment above SMB scale. The economics, model quality, operational tooling, and compliance fit have all matured. The remaining hosted-API role is fallback for the hardest 5-10% of queries plus prototyping / experimental. Most production teams: hybrid is the answer.
Bottom line
Hybrid is the 2026 production default. See market state and stack blueprint.