Table of Contents
MLOps for self-hosted AI in 2026 has stabilised around a recognisable stack. Less hype than 2022; more focused on production essentials. The right stack is mostly composition of mature open-source primitives.
Stack: vLLM / TGI for serving, HuggingFace TRL + PEFT for fine-tuning, DVC / HF datasets for data versioning, MLflow / W&B for experiment tracking, RAGAS / custom for eval, Prometheus + Grafana for metrics, structured JSON logs for traces, LiteLLM for routing, feature flags for rollout. Most are open-source; few specialist platforms genuinely needed.
The stack
- Serving: vLLM (default), TGI (HF-aligned), TensorRT-LLM (max throughput)
- Fine-tuning: TRL (SFT/DPO/ORPO), PEFT (LoRA/QLoRA), bitsandbytes, Unsloth (faster on consumer)
- Data versioning: DVC, HF datasets with commit pinning, LakeFS for very large
- Experiment tracking: MLflow (self-hosted), W&B (SaaS), Aim (lightweight self-hosted)
- Eval: RAGAS (RAG-specific), DeepEval, custom harness with LLM-as-judge
- Vector store: Qdrant, Weaviate, pgvector, Milvus
- Embeddings serving: TEI (HF), Sentence Transformers
- Reranker: BGE-reranker via TEI
- Orchestration: LangChain, LlamaIndex, native Python
- Prompt management: in-repo YAML (simple), PromptLayer / Braintrust (specialist)
- Routing: LiteLLM
- Observability: Prometheus + Grafana + Loki + OpenTelemetry
- Feature flags: GrowthBook (open-source), LaunchDarkly (SaaS)
Components
For an SMB / mid-market self-hosted AI deployment, a reasonable stack:
- vLLM + Llama 3.1 8B FP8 + LoRAX for multi-tenant fine-tunes
- TRL + PEFT for periodic fine-tuning
- DVC for dataset versioning; W&B or MLflow for experiment tracking
- RAGAS in CI; custom harness for app-specific eval
- Qdrant + TEI BGE-large + reranker for RAG
- LangChain or LlamaIndex for orchestration
- Prompts in YAML in repo; feature flags via GrowthBook
- LiteLLM for routing + hosted-API fallback
- Prometheus + Grafana + Loki + OTel for observability
Integrations
The integrations that matter:
- Logs ↔ experiments: feed production logs into eval datasets via MLflow
- Eval → CI: every PR runs eval harness, gates merge
- Feature flag ↔ routing: LiteLLM reads feature flag for traffic split
- Observability ↔ alerting: Prometheus → Alertmanager → Slack/PagerDuty
Verdict
The 2026 MLOps stack is mature open-source primitives composed thoughtfully. Few problems genuinely need specialist platforms; most teams over-buy. Start with the open-source primitives; add platforms when specific gaps emerge.
Bottom line
Open-source primitives composed; platforms only for real gaps. See stack blueprint.