RTX 3050 - Order Now
Home / Blog / Tutorials / RTX 5060 Ti 16GB vLLM Setup
Tutorials

RTX 5060 Ti 16GB vLLM Setup

Full vLLM setup on Blackwell 16GB - from fresh Ubuntu to tuned Llama 3 8B FP8 serving.

vLLM is the production-grade LLM server. Full setup guide for the RTX 5060 Ti 16GB at our hosting from a fresh Ubuntu 22.04/24.04 box.

Contents

Prerequisites

  • Ubuntu 22.04 or 24.04
  • NVIDIA driver 560+ for Blackwell
  • CUDA 12.6+
  • Python 3.10-3.12

Verify: nvidia-smi should show the 5060 Ti and driver 560+. See driver install if not present.

Install

# uv for fast package management
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create Python env
uv venv --python 3.12 ~/.venvs/vllm
source ~/.venvs/vllm/bin/activate

# Install vLLM
uv pip install vllm

# Verify CUDA compat
python -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"

Launch Llama 3.1 8B FP8

huggingface-cli login   # if private model
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --quantization fp8 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --max-num-seqs 16 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.90 \
  --port 8000

First launch downloads weights from HuggingFace (~16 GB). Subsequent launches are fast.

systemd Service

Create /etc/systemd/system/vllm.service:

[Unit]
Description=vLLM OpenAI server
After=network.target

[Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu
ExecStart=/home/ubuntu/.venvs/vllm/bin/python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --quantization fp8 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --port 8000
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now vllm
sudo systemctl status vllm
journalctl -u vllm -f

Tuning

  • --max-num-seqs 16 for chat SLAs – see batch tuning
  • --kv-cache-dtype fp8 doubles your context – see FP8 KV cache
  • --enable-prefix-caching for massive TTFT win on chat – see prefix caching
  • --enable-chunked-prefill for smooth concurrency – see chunked prefill
  • Optional: --speculative-model meta-llama/Llama-3.2-1B-Instruct --num-speculative-tokens 5 – see speculative decoding

vLLM on Blackwell 16GB

Production-grade LLM serving. UK dedicated hosting.

Order the RTX 5060 Ti 16GB

See also: FP8 Llama deployment, Ollama setup, TGI setup.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?