Table of Contents
Self-hosting an LLM is roughly a one-day project once you know what you're doing. This is the consolidated walk-through.
Pick a dedicated GPU server (5060 Ti for budget, 5090 for production). Install vLLM. Front with LiteLLM for auth. Add Prometheus + Grafana. systemd-manage the processes. Total time: under one day.
Decision: should you even self-host?
Self-hosting beats hosted APIs when:
- Token volume >1B/month
- Data residency / compliance
- Custom fine-tunes
- Sub-500ms latency in your region
- Predictable monthly cost
Stay on hosted APIs when traffic is spiky, you have no ops capacity, or you need frontier-quality reasoning.
Pick the hardware
Match the workload to the GPU — see cheapest GPU for AI inference.
Install
sudo apt install -y python3.10-venv
python3.10 -m venv ~/llm && source ~/llm/bin/activate
pip install vllm==0.6.3
vllm serve mistralai/Mistral-7B-Instruct-v0.3 \
--quantization fp8 --enable-prefix-caching \
--max-model-len 16384 --port 8000
Auth and routing
LiteLLM in front for per-key auth, rate limiting, fallbacks. Caddy for TLS.
Monitor
Prometheus + Grafana + DCGM exporter + vLLM metrics. Alert on p99 TTFT and queue depth.
Production hardening
- systemd unit with Restart=on-failure
- Pin all versions (driver, vLLM, model commit SHA)
- Structured logs to your SIEM
- Backup model checkpoints + LoRA adapters
- Documented runbook
Bottom line
Self-hosting an LLM is no longer hard. The hard parts moved to data and evaluation. See build a production AI inference server.