RTX 3050 - Order Now
Home / Blog / Tutorials / RTX 4090 24GB Full vLLM Setup From Fresh Ubuntu to Production Endpoint
Tutorials

RTX 4090 24GB Full vLLM Setup From Fresh Ubuntu to Production Endpoint

Step-by-step driver, CUDA, Python and vLLM install with WHY behind every flag, plus systemd, monitoring and post-deploy verification on the RTX 4090 24GB.

This tutorial takes a freshly provisioned RTX 4090 24GB dedicated server on Ubuntu 22.04 LTS to a production-ready vLLM endpoint serving Llama 3.1 8B at FP8 in roughly 25 minutes of attended time. Every command has been validated on UK GigaGPU stock; every flag has a stated reason for existing rather than being copied from a forum post. The end state is a systemd-managed service with Prometheus metrics, restart-on-failure, the Hugging Face fast downloader, FP8 weights, FP8 KV cache, prefix caching and chunked prefill — the full production posture rather than a “hello world” demo. For the wider hardware menu see dedicated GPU hosting.

Contents

Prerequisites and base OS sanity

You should have SSH access as a user with sudo, an HF token from huggingface.co with the Llama 3.1 community licence accepted, and at least 80 GB of free disk on the system volume (model weights, vLLM JIT cache, Hugging Face cache). Confirm the GPU is visible to the kernel:

lspci | grep -i nvidia
uname -r
cat /etc/os-release

Why each line. lspci proves the card is enumerated on the PCIe bus before any driver work; if you see no NVIDIA line here the issue is BIOS, slot or cabling, not software. Expected output contains NVIDIA Corporation AD102 [GeForce RTX 4090] on a PCIe Gen 4 x16 slot — the spec is documented in our spec breakdown and PCIe Gen 4 x16 piece. uname -r returns the kernel version; you want 5.15+ or 6.x for stable driver 550 binding. cat /etc/os-release confirms Ubuntu 22.04 LTS — the recommended baseline for current vLLM. Update the base packages and install the build prerequisites:

sudo apt update && sudo apt -y upgrade
sudo apt -y install build-essential git curl ca-certificates gnupg lsb-release pkg-config

Reboot if the kernel was updated. build-essential is needed because some vLLM transitive dependencies (notably flash-attn) compile from source on first install. pkg-config is needed for some bindings. Skipping these will produce an obscure linker error 12 minutes into your pip install.

NVIDIA driver, CUDA toolkit and power policy

Two driver families currently support the 4090’s full feature set: the 550-series (stable, recommended) and the 555-series (newer, occasional regressions on unrelated workloads). FP8 native paths and FP8 KV cache require 550 or above, full stop. Anything earlier loads the model but silently falls back to FP16 KV, which will OOM mid-decode on long contexts.

distro=ubuntu2204
arch=x86_64
wget https://developer.download.nvidia.com/compute/cuda/repos/$distro/$arch/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt -y install cuda-drivers-550 cuda-toolkit-12-4
sudo reboot

Why the keyring deb rather than the runfile installer: the apt path keeps the driver in sync with kernel updates via DKMS, so when Ubuntu auto-updates the kernel your GPU does not vanish on next boot. CUDA 12.4 is the toolkit that ships the FP8 GEMM headers and matches the vLLM 0.6.x prebuilt wheels exactly. CUDA 12.5+ also works but requires building vLLM from source against your toolkit. After reboot, verify:

nvidia-smi
nvcc --version

Expected: an RTX 4090 line, 24,564 MiB total memory, driver 550.x or higher, CUDA Version: 12.4. Set the persistence daemon and pin the power limit to 400 W:

sudo nvidia-smi -pm 1
sudo nvidia-smi -pl 400

-pm 1 keeps the driver loaded between processes — without it, the first inference request after idle pays a 2-3 second cold-start penalty as the driver re-initialises. -pl 400 trims the default 450 W power limit to 400 W, costing roughly 3-4% throughput in exchange for noticeably steadier latency under sustained load and a 10-12% reduction in the electricity bill. Detail at power draw and efficiency and tokens per watt.

Python environment and dependency hygiene

vLLM 0.6.x officially supports Python 3.10 to 3.12. We use 3.11 because it has the best wheel coverage across the dependency graph and is the version we have hammered the most in production. Never install vLLM into the system Python; doing so guarantees an upgrade collision the first time apt touches python3.

sudo apt -y install python3.11 python3.11-venv python3.11-dev
python3.11 -m venv ~/vllm-env
source ~/vllm-env/bin/activate
pip install --upgrade pip wheel

The python3.11-dev package is required because flash-attn compiles a CUDA extension against Python.h. Upgrading pip and wheel first means binary wheels are preferred over source builds for the rest of the install — the difference is roughly 8 minutes of saved compile time.

Installing vLLM and launching Llama 3 8B FP8

pip install vllm==0.6.3
pip install hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1
export HF_TOKEN=hf_yourtoken

Pinning vLLM to 0.6.3 rather than floating to latest gives reproducible behaviour. hf_transfer activates the parallel multipart Rust downloader; on a 1 Gbps link it cuts an 8 GB Llama 3 8B download from ~80 seconds to ~25 seconds. The HF_TOKEN must come from an HF account that has accepted the Llama 3.1 community licence, otherwise the download returns 403 with no warning that licensing is the cause. Now launch the server:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --quantization fp8 \
  --kv-cache-dtype fp8 \
  --max-model-len 65536 \
  --max-num-seqs 32 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.92 \
  --port 8000

Each flag explained. --quantization fp8 activates Ada’s E4M3 FP8 GEMM path, doubling tensor-core throughput versus FP16 and halving weight memory; the model is FP8-quantised at load time without an offline calibration step (Llama 3.1 ships with the activation statistics needed). --kv-cache-dtype fp8 halves the KV cache footprint, doubling the concurrent token budget; quality cost is below 0.05 perplexity. See FP8 tensor cores on Ada and the FP8 deployment guide for the underlying numbers. --max-model-len 65536 is generous; Llama 3.1 8B supports 131k natively but 64k is the sweet spot for memory-budget reasons. --max-num-seqs 32 caps continuous batching at 32 concurrent sequences — sized so total KV at full context fits in the 24 GB envelope. --enable-chunked-prefill interleaves prefill chunks with decode steps so a long-context request does not stall a fast-reply request; essential whenever your prompt-length distribution is bimodal. --enable-prefix-caching hashes the token-prefix of incoming requests and reuses computed KV blocks on hit; a RAG system with shared system prompts often sees 30-70% prefill savings. --gpu-memory-utilization 0.92 tells vLLM to size its KV pool to fill 92% of VRAM, leaving 2 GB for spikes and other CUDA processes. --port 8000 is the default; expose it through your firewall only behind a reverse proxy with TLS and auth.

Smoke test from another shell:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"meta-llama/Llama-3.1-8B-Instruct",
       "messages":[{"role":"user","content":"Say hello in 5 languages."}],
       "max_tokens":120}'

Expected: a JSON response in roughly 700-800 ms wall-clock (including network roundtrip on localhost). The usage field will report ~140 output tokens; at 195 t/s decode that is ~720 ms of pure decode plus ~50 ms prefill. If the response is empty or contains finish_reason: length with junk content, the chat template did not load — check your tokenizer config and re-run.

systemd unit for production restart semantics

Create /etc/systemd/system/vllm.service:

[Unit]
Description=vLLM OpenAI API server (Llama 3.1 8B FP8)
After=network.target

[Service]
User=ubuntu
WorkingDirectory=/home/ubuntu
Environment=HF_TOKEN=hf_yourtoken
Environment=HF_HUB_ENABLE_HF_TRANSFER=1
Environment=VLLM_LOGGING_LEVEL=INFO
ExecStart=/home/ubuntu/vllm-env/bin/python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --quantization fp8 --kv-cache-dtype fp8 \
  --max-model-len 65536 --max-num-seqs 32 \
  --enable-chunked-prefill --enable-prefix-caching \
  --gpu-memory-utilization 0.92 --port 8000
Restart=always
RestartSec=10
LimitNOFILE=1048576
TimeoutStartSec=300

[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now vllm
sudo journalctl -u vllm -f

Why each non-obvious choice. Restart=always with RestartSec=10 handles GPU hangs and OOM kills gracefully — the 10-second back-off prevents tight crash loops while still recovering inside a typical health-check window. LimitNOFILE=1048576 raises the file descriptor ceiling because vLLM under heavy concurrent load opens many sockets and shared-memory descriptors. TimeoutStartSec=300 is critical: the first boot of a new model can JIT-compile Marlin or FP8 kernels for several minutes, and the systemd default of 90 seconds will kill a still-healthy startup. VLLM_LOGGING_LEVEL=INFO ensures the boot log shows the quantisation and KV dtype confirmation lines you need for verification. Tail with journalctl -u vllm -f until you see Application startup complete.

Tuning flags table with the WHY of each

FlagDefaultSuggested for 4090Why
--gpu-memory-utilization0.900.92Squeezes ~500 MB extra KV cache safely
--max-num-seqs25632 (8B), 16 (14B), 4 (70B)Prevents KV thrash; tune per-model size
--max-model-lenmodel max65536 (8B), 32768 (14B), 16384 (70B)Bounds per-seq KV allocation
--enable-chunked-prefilloffonSmooths long-prompt latency under mixed traffic
--enable-prefix-cachingoffon30-70% prefill saved on hot prefixes
--kv-cache-dtype fp8auto (=fp16)fp8Halves KV footprint, <0.05 perplexity cost
--swap-space4 GiB8 GiBAllows larger CPU spillover under bursts
--block-size1616Default optimal for Ada
--disable-log-requestsoffon (after burn-in)Reduces journal volume in steady state

Post-deploy verification and monitoring hooks

Verify FP8 is actually engaged. The startup log should contain both quantization: fp8 and KV cache dtype: fp8_e4m3. If you see fp16 for either, your driver is too old or your vLLM build was compiled without the FP8 path. nvidia-smi after warm-up should show roughly 22.0 GB used out of 24.5 GB, GPU utilisation 95-99% during decode bursts, and 70-78 degrees C steady temperature.

vLLM exposes Prometheus metrics on /metrics. The four to alert on:

MetricAlert thresholdMeaning
vllm:gpu_cache_usage_perc> 90 sustained 60sKV thrash imminent; lower max-num-seqs
vllm:num_requests_waiting> 4 sustained 30sCapacity bottleneck; scale out
vllm:time_to_first_token_seconds p95> 1.0Prefill saturated; chunked prefill or trim prompts
vllm:time_per_output_token_seconds p95> 0.012Decode below 80 t/s; thermal or power throttling

Expected steady-state numbers at this configuration: ~195 t/s decode at batch 1, ~620 t/s at batch 8, ~1,100 t/s aggregate at batch 32. Time-to-first-token at 4k context: ~210 ms; at 32k context: ~1.2 s. Cross-reference the Llama 8B benchmark, the prefill/decode benchmark and the concurrent users page. Run a quick load test:

python -m vllm.entrypoints.openai.benchmark_serving \
  --backend vllm --base-url http://localhost:8000 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --num-prompts 200 --request-rate 8

Troubleshooting matrix and verdict

SymptomCauseFix
OOM at startupgpu-memory-utilization too high or stale CUDA processnvidia-smi, kill stragglers; drop to 0.90
“unsupported quantization fp8”vLLM < 0.5.4Pin vllm==0.6.3
FP8 KV silently disabledDriver < 550Upgrade driver, reboot, recheck startup log
Slow prefill on long promptschunked-prefill offAdd --enable-chunked-prefill
HF auth failsToken missing or licence not acceptedSet HF_TOKEN; accept licence on huggingface.co
Garbled outputWrong chat templateCheck tokenizer_config.json has chat_template set
systemd kills startup at 90sDefault TimeoutStartSecRaise to TimeoutStartSec=300
p99 latency spike at 30 minutesThermal throttleCap power at 400 W, see thermal performance

Verdict. A single-card vLLM deployment on the RTX 4090 24GB is the right starting posture for any team self-hosting an open-weight LLM in 2026. The setup above takes 25 minutes attended, costs nothing beyond the server rental, and delivers production-grade throughput, observability and restart semantics. The first upgrade is to add a second card via multi-card pairing when concurrent load exceeds ~30 active sessions, or to step up to the 5090 32GB when 14B at full context becomes the new baseline. For specific deployment patterns next, see 70B INT4 deployment, the AWQ guide, the LoRA fine-tune guide and the first day checklist.

Bring up vLLM in 25 minutes

Fresh Ubuntu to production endpoint with FP8, prefix caching and systemd restart semantics. UK dedicated hosting.

Order the RTX 4090 24GB

See also: FP8 Llama deployment, AWQ guide, 70B INT4 deployment, LoRA tutorial, first day checklist, spec breakdown, monthly hosting cost.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?