Home / Blog / Tutorials / vLLM on ROCm: Setup Guide for AMD GPUs (MI300X, RX 7900 XTX)

Tutorials

vLLM on ROCm: Setup Guide for AMD GPUs (MI300X, RX 7900 XTX)

End-to-end guide to running vLLM on AMD ROCm GPUs, with install steps, model compatibility and CUDA performance comparisons.

Tutorials April 23, 2026 3 min read admin

vLLM is no longer a CUDA-only project. Since the ROCm backend matured in 2024 and Triton-ROCm caught up in 2025, AMD GPUs from the MI250 through the MI300X and the consumer RX 7900 XTX and RX 9070 XT can run vLLM with respectable throughput. This tutorial covers the install, the vLLM ROCm build, which models actually work, and how performance compares to equivalent NVIDIA silicon. If you are weighing AMD hardware for your next workload, our dedicated GPU hosting fleet includes both ROCm and CUDA-ready boxes.

ROCm stack overview
Installing ROCm 6.3 on Ubuntu 22.04
Building vLLM for ROCm
Model compatibility matrix
ROCm vs CUDA throughput
Known gotchas

ROCm stack overview

ROCm is AMD’s equivalent of the CUDA toolkit: it ships HIP (the CUDA-like runtime), rocBLAS, MIOpen, Composable Kernel (CK), and hipBLASLt. vLLM on ROCm uses CK attention kernels on CDNA3 (MI300) and Triton-based attention on RDNA3/4 (RX 7900 XTX, RX 9070 XT). Support levels differ sharply between data-centre and consumer parts.

GPU	Arch	VRAM	Bandwidth	vLLM status
MI300X	CDNA3	192 GB HBM3	5.3 TB/s	Production
MI250X	CDNA2	128 GB HBM2e	3.2 TB/s	Production
RX 7900 XTX	RDNA3	24 GB	960 GB/s	Supported
RX 9070 XT	RDNA4	16 GB	640 GB/s	Supported (Triton)

Installing ROCm 6.3 on Ubuntu 22.04

Pick ROCm 6.3 or newer. Earlier versions lack FP8 support on MI300 and have broken paged-attention kernels on RDNA3. The install sequence is three steps: add the repo, install the meta-package, then reboot so the amdgpu kernel module loads cleanly.

wget https://repo.radeon.com/amdgpu-install/6.3/ubuntu/jammy/amdgpu-install_6.3.60300-1_all.deb
sudo apt install ./amdgpu-install_6.3.60300-1_all.deb
sudo amdgpu-install --usecase=rocm,hiplibsdk
sudo usermod -a -G render,video $USER
reboot

Verify with rocminfo and rocm-smi. On a fresh MI300X you should see 192 GB of HBM and 304 compute units. On RDNA consumer parts you will need HSA_OVERRIDE_GFX_VERSION=11.0.0 for RX 7900 XTX or 12.0.0 for RX 9070 XT.

Building vLLM for ROCm

vLLM ships a dedicated ROCm Dockerfile (Dockerfile.rocm). The cleanest path is to use it rather than fighting pip dependencies. Clone the repo, build the image, then run with GPU passthrough.

git clone https://github.com/vllm-project/vllm
cd vllm
DOCKER_BUILDKIT=1 docker build -f Dockerfile.rocm -t vllm-rocm .
docker run -it --rm --device=/dev/kfd --device=/dev/dri \
  --group-add video --ipc=host --shm-size=16g \
  vllm-rocm python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct --tensor-parallel-size 1

On a single MI300X, Llama 3.1 70B fits without sharding thanks to the 192 GB of HBM. On the 24 GB RX 7900 XTX you are restricted to 7B or 8B models in FP16, or 13B in INT4.

Model compatibility matrix

Model	MI300X FP16	RX 7900 XTX	RX 9070 XT 16GB
Llama 3.1 8B	Yes	Yes	Yes (FP8)
Llama 3.1 70B	Yes	INT4 only	No
Mixtral 8x7B	Yes	INT4 only	No
Qwen2.5 32B	Yes	INT4 only	INT4 tight
DeepSeek-V3 671B	8x MI300X	No	No

ROCm vs CUDA throughput

On equivalent batch sizes, MI300X is faster than a single H100 80GB on Llama 70B thanks to the HBM capacity (no tensor-parallel overhead). On small consumer parts the story flips: a 24 GB RX 7900 XTX is roughly 15-20% behind a 24 GB RTX 3090 on Llama 3.1 8B, mostly because of less mature paged-attention kernels. The RTX 5090 is materially ahead of any RDNA part for LLM inference; compare it directly in our 3090 vs 5090 guide.

Deploy vLLM on an AMD or NVIDIA GPU

ROCm-ready MI300X and RX 7900 XTX, plus Blackwell NVIDIA cards. UK dedicated hosting.

Browse GPU Servers

Known gotchas

Flash-Attention on RDNA is Triton-based and slower than CK; benchmark before committing.
AWQ quantisation works, GPTQ is hit-and-miss on ROCm 6.3.
NCCL equivalents (rccl) need the --network=host flag in Docker for multi-GPU.
HIP graphs are still flaky on some RDNA SKUs; disable with --enforce-eager if you see hangs.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

vLLM on ROCm: Setup Guide for AMD GPUs (MI300X, RX 7900 XTX)

Contents

ROCm stack overview

Installing ROCm 6.3 on Ubuntu 22.04

Building vLLM for ROCm

Model compatibility matrix

ROCm vs CUDA throughput

Deploy vLLM on an AMD or NVIDIA GPU

Known gotchas

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

vLLM on ROCm: Setup Guide for AMD GPUs (MI300X, RX 7900 XTX)

Contents

ROCm stack overview

Installing ROCm 6.3 on Ubuntu 22.04

Building vLLM for ROCm

Model compatibility matrix

ROCm vs CUDA throughput

Deploy vLLM on an AMD or NVIDIA GPU

Known gotchas

Need a Dedicated GPU Server?

admin

Related Articles

Fine-Tuning an Embedding Model on a Dedicated GPU

RTX 5060 Ti 16GB Whisper API Setup

Sentence Transformers GPU Batch Tuning

Flask AI API: LLM Inference Wrapper

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?