Table of Contents
The open-source LLM hosting stack has matured in 2026 around a small number of well-tested patterns. This page is the consolidated architectural overview we hand to teams starting fresh.
The default stack: vLLM + LiteLLM + Caddy on dedicated GPU hardware, Prometheus + Grafana for observability, Qdrant for RAG vector storage. systemd manages processes. Ubuntu 22.04 + NVIDIA driver 555+ is the baseline OS.
Architecture layers
- Hardware: dedicated GPU server (RTX 5060 Ti / 5090 / 6000 Pro tier)
- OS: Ubuntu 22.04 LTS, NVIDIA driver pinned
- Inference engine: vLLM (default), TGI (alt), Ollama (dev)
- Router / auth: LiteLLM with per-key budgets and rate limits
- Reverse proxy: Caddy with TLS / mTLS
- Vector store: Qdrant (preferred) or pgvector
- Embeddings: Text Embeddings Inference (TEI)
- Observability: Prometheus + Grafana + DCGM exporter
- Process management: systemd (single server) or K3s (small cluster)
- Logging: structured JSON to your SIEM
Major decisions
- Single server or cluster? Start single. Scale out only when traffic forces it.
- Quantisation strategy? FP8 by default on Blackwell, AWQ-INT4 for memory-tight.
- Open-weight model selection? Mistral 7B, Llama 3.1 8B, Qwen 2.5 14B for chat. Specialised models per workload.
- Hosted-API fallback? Yes — keep one wired up for overflow and frontier-quality queries.
- Self-host embeddings or call API? Self-host above ~10K embeddings/day.
Anti-patterns
- Running LLM serving on multi-tenant cloud GPUs (preemption kills you)
- Skipping observability until production breaks (it always breaks)
- Over-spec’ing GPU for the wrong reason (50K embeddings/sec doesn’t need a 5090)
- Not pinning model commit SHAs (silent quality regressions)
- Putting Ollama in front of paying users
- Hosting on consumer hardware without proper datacenter cooling
Verdict
The 2026 open-source LLM stack is well-trodden. Pick the defaults, pin the versions, ship the boring infrastructure first, model second.
Bottom line
For a complete deployment runbook see build a production AI inference server.