vLLM is no longer a CUDA-only project. Since the ROCm backend matured in 2024 and Triton-ROCm caught up in 2025, AMD GPUs from the MI250 through the MI300X and the consumer RX 7900 XTX and RX 9070 XT can run vLLM with respectable throughput. This tutorial covers the install, the vLLM ROCm build, which models actually work, and how performance compares to equivalent NVIDIA silicon. If you are weighing AMD hardware for your next workload, our dedicated GPU hosting fleet includes both ROCm and CUDA-ready boxes.
Contents
- ROCm stack overview
- Installing ROCm 6.3 on Ubuntu 22.04
- Building vLLM for ROCm
- Model compatibility matrix
- ROCm vs CUDA throughput
- Known gotchas
ROCm stack overview
ROCm is AMD’s equivalent of the CUDA toolkit: it ships HIP (the CUDA-like runtime), rocBLAS, MIOpen, Composable Kernel (CK), and hipBLASLt. vLLM on ROCm uses CK attention kernels on CDNA3 (MI300) and Triton-based attention on RDNA3/4 (RX 7900 XTX, RX 9070 XT). Support levels differ sharply between data-centre and consumer parts.
| GPU | Arch | VRAM | Bandwidth | vLLM status |
|---|---|---|---|---|
| MI300X | CDNA3 | 192 GB HBM3 | 5.3 TB/s | Production |
| MI250X | CDNA2 | 128 GB HBM2e | 3.2 TB/s | Production |
| RX 7900 XTX | RDNA3 | 24 GB | 960 GB/s | Supported |
| RX 9070 XT | RDNA4 | 16 GB | 640 GB/s | Supported (Triton) |
Installing ROCm 6.3 on Ubuntu 22.04
Pick ROCm 6.3 or newer. Earlier versions lack FP8 support on MI300 and have broken paged-attention kernels on RDNA3. The install sequence is three steps: add the repo, install the meta-package, then reboot so the amdgpu kernel module loads cleanly.
wget https://repo.radeon.com/amdgpu-install/6.3/ubuntu/jammy/amdgpu-install_6.3.60300-1_all.deb
sudo apt install ./amdgpu-install_6.3.60300-1_all.deb
sudo amdgpu-install --usecase=rocm,hiplibsdk
sudo usermod -a -G render,video $USER
reboot
Verify with rocminfo and rocm-smi. On a fresh MI300X you should see 192 GB of HBM and 304 compute units. On RDNA consumer parts you will need HSA_OVERRIDE_GFX_VERSION=11.0.0 for RX 7900 XTX or 12.0.0 for RX 9070 XT.
Building vLLM for ROCm
vLLM ships a dedicated ROCm Dockerfile (Dockerfile.rocm). The cleanest path is to use it rather than fighting pip dependencies. Clone the repo, build the image, then run with GPU passthrough.
git clone https://github.com/vllm-project/vllm
cd vllm
DOCKER_BUILDKIT=1 docker build -f Dockerfile.rocm -t vllm-rocm .
docker run -it --rm --device=/dev/kfd --device=/dev/dri \
--group-add video --ipc=host --shm-size=16g \
vllm-rocm python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct --tensor-parallel-size 1
On a single MI300X, Llama 3.1 70B fits without sharding thanks to the 192 GB of HBM. On the 24 GB RX 7900 XTX you are restricted to 7B or 8B models in FP16, or 13B in INT4.
Model compatibility matrix
| Model | MI300X FP16 | RX 7900 XTX | RX 9070 XT 16GB |
|---|---|---|---|
| Llama 3.1 8B | Yes | Yes | Yes (FP8) |
| Llama 3.1 70B | Yes | INT4 only | No |
| Mixtral 8x7B | Yes | INT4 only | No |
| Qwen2.5 32B | Yes | INT4 only | INT4 tight |
| DeepSeek-V3 671B | 8x MI300X | No | No |
ROCm vs CUDA throughput
On equivalent batch sizes, MI300X is faster than a single H100 80GB on Llama 70B thanks to the HBM capacity (no tensor-parallel overhead). On small consumer parts the story flips: a 24 GB RX 7900 XTX is roughly 15-20% behind a 24 GB RTX 3090 on Llama 3.1 8B, mostly because of less mature paged-attention kernels. The RTX 5090 is materially ahead of any RDNA part for LLM inference; compare it directly in our 3090 vs 5090 guide.
Deploy vLLM on an AMD or NVIDIA GPU
ROCm-ready MI300X and RX 7900 XTX, plus Blackwell NVIDIA cards. UK dedicated hosting.
Browse GPU ServersKnown gotchas
- Flash-Attention on RDNA is Triton-based and slower than CK; benchmark before committing.
- AWQ quantisation works, GPTQ is hit-and-miss on ROCm 6.3.
- NCCL equivalents (
rccl) need the--network=hostflag in Docker for multi-GPU. - HIP graphs are still flaky on some RDNA SKUs; disable with
--enforce-eagerif you see hangs.
See also: ROCm vs CUDA in production, vLLM on RTX 5060 Ti, FP8 Llama deployment, 5060 Ti vs 3090, tokens per watt.