Day one of a new RTX 5060 Ti 16GB server on our UK dedicated GPU hosting should leave you with a secured, monitored box running its first model. Work through this checklist in order – the whole thing takes roughly an hour including the first benchmark.
Contents
- Verify hardware
- Secure the box
- Install runtime stack
- Monitoring
- Performance tuning
- First serve and benchmark
Verify Hardware
nvidia-smi # Expect: RTX 5060 Ti 16GB, driver 560+
lspci | grep -i nvidia # Should list GB206
sudo dmesg | grep -i nvidia # No errors expected
sudo nvidia-smi -pm 1 # Enable persistence mode
If driver is older than 560, rebuild – see Ubuntu driver install. Persistence mode prevents the driver unloading between jobs, which shaves cold-start time.
Secure the Box
- Disable password SSH auth, keys only: edit
/etc/ssh/sshd_config, setPasswordAuthentication no - UFW allow-list: 22 (SSH), 80/443 (public apps only), deny everything else inbound
sudo apt update && sudo apt full-upgrade -y && sudo reboot- Install fail2ban for SSH brute-force protection
- Create a non-root user for all AI services; never serve vLLM as root
- Enable unattended-upgrades for security patches
Install Runtime Stack
| Layer | Install |
|---|---|
| CUDA toolkit 12.6 | sudo apt install cuda-toolkit-12-6 |
| Docker + NVIDIA Container Toolkit | See Docker CUDA setup |
| Python 3.12 + uv | curl -LsSf https://astral.sh/uv/install.sh | sh |
| vLLM venv | uv venv ~/.venvs/vllm && uv pip install vllm |
| Reverse proxy | Caddy (simplest TLS) or nginx |
Monitoring
Ship three signals to a dashboard from day one: GPU utilisation, VRAM usage, p99 request latency.
- DCGM Exporter on port 9400 for GPU metrics
- Node Exporter on 9100 for CPU/disk/network
- Prometheus scraping both, Grafana for dashboards
- Alert rules: p99 latency > 2s, GPU temp > 80°C, VRAM > 95%
Performance Tuning
sudo nvidia-smi -pm 1– persistence mode on- CPU governor to
performance:sudo cpupower frequency-set -g performance - Disable transparent huge pages for latency workloads:
echo never > /sys/kernel/mm/transparent_hugepage/enabled - Move HuggingFace cache to fastest NVMe:
export HF_HOME=/fast-nvme/hf - Ensure PCIe is negotiated at Gen 5 x8 – check with
sudo lspci -vv | grep LnkSta
First Serve and Benchmark
Kick off Llama 3.1 8B FP8 with the standard config from our vLLM setup guide, then run the sanity test script and the benchmark script. Expected numbers:
| Metric | Pass threshold |
|---|---|
| TTFT p99 at batch 8 | < 500 ms |
| Decode t/s at batch 1 | > 100 |
| GPU temp under load | < 78°C |
| Aggregate throughput batch 32 | > 650 t/s |
If everything hits the marks you’re ready for your first real traffic.
Production-Ready in an Hour
UK dedicated hosting with drivers preinstalled. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: sanity test script, benchmark script, driver install, Docker CUDA setup, vLLM setup.