vLLM is the production-grade LLM server. Full setup guide for the RTX 5060 Ti 16GB at our hosting from a fresh Ubuntu 22.04/24.04 box.
Contents
Prerequisites
- Ubuntu 22.04 or 24.04
- NVIDIA driver 560+ for Blackwell
- CUDA 12.6+
- Python 3.10-3.12
Verify: nvidia-smi should show the 5060 Ti and driver 560+. See driver install if not present.
Install
# uv for fast package management
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create Python env
uv venv --python 3.12 ~/.venvs/vllm
source ~/.venvs/vllm/bin/activate
# Install vLLM
uv pip install vllm
# Verify CUDA compat
python -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"
Launch Llama 3.1 8B FP8
huggingface-cli login # if private model
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--quantization fp8 \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--max-num-seqs 16 \
--enable-chunked-prefill \
--enable-prefix-caching \
--gpu-memory-utilization 0.90 \
--port 8000
First launch downloads weights from HuggingFace (~16 GB). Subsequent launches are fast.
systemd Service
Create /etc/systemd/system/vllm.service:
[Unit]
Description=vLLM OpenAI server
After=network.target
[Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu
ExecStart=/home/ubuntu/.venvs/vllm/bin/python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--quantization fp8 \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--enable-chunked-prefill \
--enable-prefix-caching \
--port 8000
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now vllm
sudo systemctl status vllm
journalctl -u vllm -f
Tuning
--max-num-seqs 16for chat SLAs – see batch tuning--kv-cache-dtype fp8doubles your context – see FP8 KV cache--enable-prefix-cachingfor massive TTFT win on chat – see prefix caching--enable-chunked-prefillfor smooth concurrency – see chunked prefill- Optional:
--speculative-model meta-llama/Llama-3.2-1B-Instruct --num-speculative-tokens 5– see speculative decoding
vLLM on Blackwell 16GB
Production-grade LLM serving. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: FP8 Llama deployment, Ollama setup, TGI setup.