This tutorial takes a freshly provisioned RTX 4090 24GB dedicated server on Ubuntu 22.04 LTS to a production-ready vLLM endpoint serving Llama 3.1 8B at FP8 in roughly 25 minutes of attended time. Every command has been validated on UK GigaGPU stock; every flag has a stated reason for existing rather than being copied from a forum post. The end state is a systemd-managed service with Prometheus metrics, restart-on-failure, the Hugging Face fast downloader, FP8 weights, FP8 KV cache, prefix caching and chunked prefill — the full production posture rather than a “hello world” demo. For the wider hardware menu see dedicated GPU hosting.
Contents
- Prerequisites and base OS sanity
- NVIDIA driver, CUDA toolkit and power policy
- Python environment and dependency hygiene
- Installing vLLM and launching Llama 3 8B FP8
- systemd unit for production restart semantics
- Tuning flags table with the WHY of each
- Post-deploy verification and monitoring hooks
- Troubleshooting matrix and verdict
Prerequisites and base OS sanity
You should have SSH access as a user with sudo, an HF token from huggingface.co with the Llama 3.1 community licence accepted, and at least 80 GB of free disk on the system volume (model weights, vLLM JIT cache, Hugging Face cache). Confirm the GPU is visible to the kernel:
lspci | grep -i nvidia
uname -r
cat /etc/os-release
Why each line. lspci proves the card is enumerated on the PCIe bus before any driver work; if you see no NVIDIA line here the issue is BIOS, slot or cabling, not software. Expected output contains NVIDIA Corporation AD102 [GeForce RTX 4090] on a PCIe Gen 4 x16 slot — the spec is documented in our spec breakdown and PCIe Gen 4 x16 piece. uname -r returns the kernel version; you want 5.15+ or 6.x for stable driver 550 binding. cat /etc/os-release confirms Ubuntu 22.04 LTS — the recommended baseline for current vLLM. Update the base packages and install the build prerequisites:
sudo apt update && sudo apt -y upgrade
sudo apt -y install build-essential git curl ca-certificates gnupg lsb-release pkg-config
Reboot if the kernel was updated. build-essential is needed because some vLLM transitive dependencies (notably flash-attn) compile from source on first install. pkg-config is needed for some bindings. Skipping these will produce an obscure linker error 12 minutes into your pip install.
NVIDIA driver, CUDA toolkit and power policy
Two driver families currently support the 4090’s full feature set: the 550-series (stable, recommended) and the 555-series (newer, occasional regressions on unrelated workloads). FP8 native paths and FP8 KV cache require 550 or above, full stop. Anything earlier loads the model but silently falls back to FP16 KV, which will OOM mid-decode on long contexts.
distro=ubuntu2204
arch=x86_64
wget https://developer.download.nvidia.com/compute/cuda/repos/$distro/$arch/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt -y install cuda-drivers-550 cuda-toolkit-12-4
sudo reboot
Why the keyring deb rather than the runfile installer: the apt path keeps the driver in sync with kernel updates via DKMS, so when Ubuntu auto-updates the kernel your GPU does not vanish on next boot. CUDA 12.4 is the toolkit that ships the FP8 GEMM headers and matches the vLLM 0.6.x prebuilt wheels exactly. CUDA 12.5+ also works but requires building vLLM from source against your toolkit. After reboot, verify:
nvidia-smi
nvcc --version
Expected: an RTX 4090 line, 24,564 MiB total memory, driver 550.x or higher, CUDA Version: 12.4. Set the persistence daemon and pin the power limit to 400 W:
sudo nvidia-smi -pm 1
sudo nvidia-smi -pl 400
-pm 1 keeps the driver loaded between processes — without it, the first inference request after idle pays a 2-3 second cold-start penalty as the driver re-initialises. -pl 400 trims the default 450 W power limit to 400 W, costing roughly 3-4% throughput in exchange for noticeably steadier latency under sustained load and a 10-12% reduction in the electricity bill. Detail at power draw and efficiency and tokens per watt.
Python environment and dependency hygiene
vLLM 0.6.x officially supports Python 3.10 to 3.12. We use 3.11 because it has the best wheel coverage across the dependency graph and is the version we have hammered the most in production. Never install vLLM into the system Python; doing so guarantees an upgrade collision the first time apt touches python3.
sudo apt -y install python3.11 python3.11-venv python3.11-dev
python3.11 -m venv ~/vllm-env
source ~/vllm-env/bin/activate
pip install --upgrade pip wheel
The python3.11-dev package is required because flash-attn compiles a CUDA extension against Python.h. Upgrading pip and wheel first means binary wheels are preferred over source builds for the rest of the install — the difference is roughly 8 minutes of saved compile time.
Installing vLLM and launching Llama 3 8B FP8
pip install vllm==0.6.3
pip install hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1
export HF_TOKEN=hf_yourtoken
Pinning vLLM to 0.6.3 rather than floating to latest gives reproducible behaviour. hf_transfer activates the parallel multipart Rust downloader; on a 1 Gbps link it cuts an 8 GB Llama 3 8B download from ~80 seconds to ~25 seconds. The HF_TOKEN must come from an HF account that has accepted the Llama 3.1 community licence, otherwise the download returns 403 with no warning that licensing is the cause. Now launch the server:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--quantization fp8 \
--kv-cache-dtype fp8 \
--max-model-len 65536 \
--max-num-seqs 32 \
--enable-chunked-prefill \
--enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--port 8000
Each flag explained. --quantization fp8 activates Ada’s E4M3 FP8 GEMM path, doubling tensor-core throughput versus FP16 and halving weight memory; the model is FP8-quantised at load time without an offline calibration step (Llama 3.1 ships with the activation statistics needed). --kv-cache-dtype fp8 halves the KV cache footprint, doubling the concurrent token budget; quality cost is below 0.05 perplexity. See FP8 tensor cores on Ada and the FP8 deployment guide for the underlying numbers. --max-model-len 65536 is generous; Llama 3.1 8B supports 131k natively but 64k is the sweet spot for memory-budget reasons. --max-num-seqs 32 caps continuous batching at 32 concurrent sequences — sized so total KV at full context fits in the 24 GB envelope. --enable-chunked-prefill interleaves prefill chunks with decode steps so a long-context request does not stall a fast-reply request; essential whenever your prompt-length distribution is bimodal. --enable-prefix-caching hashes the token-prefix of incoming requests and reuses computed KV blocks on hit; a RAG system with shared system prompts often sees 30-70% prefill savings. --gpu-memory-utilization 0.92 tells vLLM to size its KV pool to fill 92% of VRAM, leaving 2 GB for spikes and other CUDA processes. --port 8000 is the default; expose it through your firewall only behind a reverse proxy with TLS and auth.
Smoke test from another shell:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"meta-llama/Llama-3.1-8B-Instruct",
"messages":[{"role":"user","content":"Say hello in 5 languages."}],
"max_tokens":120}'
Expected: a JSON response in roughly 700-800 ms wall-clock (including network roundtrip on localhost). The usage field will report ~140 output tokens; at 195 t/s decode that is ~720 ms of pure decode plus ~50 ms prefill. If the response is empty or contains finish_reason: length with junk content, the chat template did not load — check your tokenizer config and re-run.
systemd unit for production restart semantics
Create /etc/systemd/system/vllm.service:
[Unit]
Description=vLLM OpenAI API server (Llama 3.1 8B FP8)
After=network.target
[Service]
User=ubuntu
WorkingDirectory=/home/ubuntu
Environment=HF_TOKEN=hf_yourtoken
Environment=HF_HUB_ENABLE_HF_TRANSFER=1
Environment=VLLM_LOGGING_LEVEL=INFO
ExecStart=/home/ubuntu/vllm-env/bin/python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--quantization fp8 --kv-cache-dtype fp8 \
--max-model-len 65536 --max-num-seqs 32 \
--enable-chunked-prefill --enable-prefix-caching \
--gpu-memory-utilization 0.92 --port 8000
Restart=always
RestartSec=10
LimitNOFILE=1048576
TimeoutStartSec=300
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now vllm
sudo journalctl -u vllm -f
Why each non-obvious choice. Restart=always with RestartSec=10 handles GPU hangs and OOM kills gracefully — the 10-second back-off prevents tight crash loops while still recovering inside a typical health-check window. LimitNOFILE=1048576 raises the file descriptor ceiling because vLLM under heavy concurrent load opens many sockets and shared-memory descriptors. TimeoutStartSec=300 is critical: the first boot of a new model can JIT-compile Marlin or FP8 kernels for several minutes, and the systemd default of 90 seconds will kill a still-healthy startup. VLLM_LOGGING_LEVEL=INFO ensures the boot log shows the quantisation and KV dtype confirmation lines you need for verification. Tail with journalctl -u vllm -f until you see Application startup complete.
Tuning flags table with the WHY of each
| Flag | Default | Suggested for 4090 | Why |
|---|---|---|---|
--gpu-memory-utilization | 0.90 | 0.92 | Squeezes ~500 MB extra KV cache safely |
--max-num-seqs | 256 | 32 (8B), 16 (14B), 4 (70B) | Prevents KV thrash; tune per-model size |
--max-model-len | model max | 65536 (8B), 32768 (14B), 16384 (70B) | Bounds per-seq KV allocation |
--enable-chunked-prefill | off | on | Smooths long-prompt latency under mixed traffic |
--enable-prefix-caching | off | on | 30-70% prefill saved on hot prefixes |
--kv-cache-dtype fp8 | auto (=fp16) | fp8 | Halves KV footprint, <0.05 perplexity cost |
--swap-space | 4 GiB | 8 GiB | Allows larger CPU spillover under bursts |
--block-size | 16 | 16 | Default optimal for Ada |
--disable-log-requests | off | on (after burn-in) | Reduces journal volume in steady state |
Post-deploy verification and monitoring hooks
Verify FP8 is actually engaged. The startup log should contain both quantization: fp8 and KV cache dtype: fp8_e4m3. If you see fp16 for either, your driver is too old or your vLLM build was compiled without the FP8 path. nvidia-smi after warm-up should show roughly 22.0 GB used out of 24.5 GB, GPU utilisation 95-99% during decode bursts, and 70-78 degrees C steady temperature.
vLLM exposes Prometheus metrics on /metrics. The four to alert on:
| Metric | Alert threshold | Meaning |
|---|---|---|
vllm:gpu_cache_usage_perc | > 90 sustained 60s | KV thrash imminent; lower max-num-seqs |
vllm:num_requests_waiting | > 4 sustained 30s | Capacity bottleneck; scale out |
vllm:time_to_first_token_seconds p95 | > 1.0 | Prefill saturated; chunked prefill or trim prompts |
vllm:time_per_output_token_seconds p95 | > 0.012 | Decode below 80 t/s; thermal or power throttling |
Expected steady-state numbers at this configuration: ~195 t/s decode at batch 1, ~620 t/s at batch 8, ~1,100 t/s aggregate at batch 32. Time-to-first-token at 4k context: ~210 ms; at 32k context: ~1.2 s. Cross-reference the Llama 8B benchmark, the prefill/decode benchmark and the concurrent users page. Run a quick load test:
python -m vllm.entrypoints.openai.benchmark_serving \
--backend vllm --base-url http://localhost:8000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--num-prompts 200 --request-rate 8
Troubleshooting matrix and verdict
| Symptom | Cause | Fix |
|---|---|---|
| OOM at startup | gpu-memory-utilization too high or stale CUDA process | nvidia-smi, kill stragglers; drop to 0.90 |
| “unsupported quantization fp8” | vLLM < 0.5.4 | Pin vllm==0.6.3 |
| FP8 KV silently disabled | Driver < 550 | Upgrade driver, reboot, recheck startup log |
| Slow prefill on long prompts | chunked-prefill off | Add --enable-chunked-prefill |
| HF auth fails | Token missing or licence not accepted | Set HF_TOKEN; accept licence on huggingface.co |
| Garbled output | Wrong chat template | Check tokenizer_config.json has chat_template set |
| systemd kills startup at 90s | Default TimeoutStartSec | Raise to TimeoutStartSec=300 |
| p99 latency spike at 30 minutes | Thermal throttle | Cap power at 400 W, see thermal performance |
Verdict. A single-card vLLM deployment on the RTX 4090 24GB is the right starting posture for any team self-hosting an open-weight LLM in 2026. The setup above takes 25 minutes attended, costs nothing beyond the server rental, and delivers production-grade throughput, observability and restart semantics. The first upgrade is to add a second card via multi-card pairing when concurrent load exceeds ~30 active sessions, or to step up to the 5090 32GB when 14B at full context becomes the new baseline. For specific deployment patterns next, see 70B INT4 deployment, the AWQ guide, the LoRA fine-tune guide and the first day checklist.
Bring up vLLM in 25 minutes
Fresh Ubuntu to production endpoint with FP8, prefix caching and systemd restart semantics. UK dedicated hosting.
Order the RTX 4090 24GBSee also: FP8 Llama deployment, AWQ guide, 70B INT4 deployment, LoRA tutorial, first day checklist, spec breakdown, monthly hosting cost.