RTX 3050 - Order Now
Home / Blog / Tutorials / vLLM on ROCm: Setup Guide for AMD GPUs (MI300X, RX 7900 XTX)
Tutorials

vLLM on ROCm: Setup Guide for AMD GPUs (MI300X, RX 7900 XTX)

End-to-end guide to running vLLM on AMD ROCm GPUs, with install steps, model compatibility and CUDA performance comparisons.

vLLM is no longer a CUDA-only project. Since the ROCm backend matured in 2024 and Triton-ROCm caught up in 2025, AMD GPUs from the MI250 through the MI300X and the consumer RX 7900 XTX and RX 9070 XT can run vLLM with respectable throughput. This tutorial covers the install, the vLLM ROCm build, which models actually work, and how performance compares to equivalent NVIDIA silicon. If you are weighing AMD hardware for your next workload, our dedicated GPU hosting fleet includes both ROCm and CUDA-ready boxes.

Contents

ROCm stack overview

ROCm is AMD’s equivalent of the CUDA toolkit: it ships HIP (the CUDA-like runtime), rocBLAS, MIOpen, Composable Kernel (CK), and hipBLASLt. vLLM on ROCm uses CK attention kernels on CDNA3 (MI300) and Triton-based attention on RDNA3/4 (RX 7900 XTX, RX 9070 XT). Support levels differ sharply between data-centre and consumer parts.

GPUArchVRAMBandwidthvLLM status
MI300XCDNA3192 GB HBM35.3 TB/sProduction
MI250XCDNA2128 GB HBM2e3.2 TB/sProduction
RX 7900 XTXRDNA324 GB960 GB/sSupported
RX 9070 XTRDNA416 GB640 GB/sSupported (Triton)

Installing ROCm 6.3 on Ubuntu 22.04

Pick ROCm 6.3 or newer. Earlier versions lack FP8 support on MI300 and have broken paged-attention kernels on RDNA3. The install sequence is three steps: add the repo, install the meta-package, then reboot so the amdgpu kernel module loads cleanly.

wget https://repo.radeon.com/amdgpu-install/6.3/ubuntu/jammy/amdgpu-install_6.3.60300-1_all.deb
sudo apt install ./amdgpu-install_6.3.60300-1_all.deb
sudo amdgpu-install --usecase=rocm,hiplibsdk
sudo usermod -a -G render,video $USER
reboot

Verify with rocminfo and rocm-smi. On a fresh MI300X you should see 192 GB of HBM and 304 compute units. On RDNA consumer parts you will need HSA_OVERRIDE_GFX_VERSION=11.0.0 for RX 7900 XTX or 12.0.0 for RX 9070 XT.

Building vLLM for ROCm

vLLM ships a dedicated ROCm Dockerfile (Dockerfile.rocm). The cleanest path is to use it rather than fighting pip dependencies. Clone the repo, build the image, then run with GPU passthrough.

git clone https://github.com/vllm-project/vllm
cd vllm
DOCKER_BUILDKIT=1 docker build -f Dockerfile.rocm -t vllm-rocm .
docker run -it --rm --device=/dev/kfd --device=/dev/dri \
  --group-add video --ipc=host --shm-size=16g \
  vllm-rocm python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct --tensor-parallel-size 1

On a single MI300X, Llama 3.1 70B fits without sharding thanks to the 192 GB of HBM. On the 24 GB RX 7900 XTX you are restricted to 7B or 8B models in FP16, or 13B in INT4.

Model compatibility matrix

ModelMI300X FP16RX 7900 XTXRX 9070 XT 16GB
Llama 3.1 8BYesYesYes (FP8)
Llama 3.1 70BYesINT4 onlyNo
Mixtral 8x7BYesINT4 onlyNo
Qwen2.5 32BYesINT4 onlyINT4 tight
DeepSeek-V3 671B8x MI300XNoNo

ROCm vs CUDA throughput

On equivalent batch sizes, MI300X is faster than a single H100 80GB on Llama 70B thanks to the HBM capacity (no tensor-parallel overhead). On small consumer parts the story flips: a 24 GB RX 7900 XTX is roughly 15-20% behind a 24 GB RTX 3090 on Llama 3.1 8B, mostly because of less mature paged-attention kernels. The RTX 5090 is materially ahead of any RDNA part for LLM inference; compare it directly in our 3090 vs 5090 guide.

Deploy vLLM on an AMD or NVIDIA GPU

ROCm-ready MI300X and RX 7900 XTX, plus Blackwell NVIDIA cards. UK dedicated hosting.

Browse GPU Servers

Known gotchas

  • Flash-Attention on RDNA is Triton-based and slower than CK; benchmark before committing.
  • AWQ quantisation works, GPTQ is hit-and-miss on ROCm 6.3.
  • NCCL equivalents (rccl) need the --network=host flag in Docker for multi-GPU.
  • HIP graphs are still flaky on some RDNA SKUs; disable with --enforce-eager if you see hangs.

See also: ROCm vs CUDA in production, vLLM on RTX 5060 Ti, FP8 Llama deployment, 5060 Ti vs 3090, tokens per watt.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?