Every 12 months somebody declares ROCm “finally ready”. In 2026 that is, for the first time, partially true: the large-model training and inference story on MI300X is competitive with H100, and consumer RDNA cards can run most mainstream models. But CUDA still wins on breadth, tooling depth and day-one model support. This article is an honest parity check across PyTorch, vLLM, FlashAttention, Triton and price, so you can decide whether to deploy on AMD, NVIDIA, or a mix. If you want to test both, we stock ROCm and CUDA hardware on dedicated GPU hosting.
Contents
- PyTorch feature parity
- Inference stacks: vLLM, SGLang, TGI
- Performance delta
- Price per GB of VRAM
- When ROCm wins and when it loses
- 2026 verdict
PyTorch feature parity
PyTorch 2.6 treats ROCm as a first-class backend, but individual features ship at different rates. The table below reflects the state as of Q2 2026.
| Feature | CUDA | ROCm 6.3 | Notes |
|---|---|---|---|
| torch.compile | Full | Full | Parity since PT 2.4 |
| FlashAttention-3 | Yes | Partial | CK on CDNA3, Triton on RDNA |
| Triton | Native | Native | ROCm Triton backend stable |
| FP8 (E4M3/E5M2) | Hopper+ | MI300 only | RDNA3 lacks hardware FP8 |
| FSDP2 | Yes | Yes | Via rccl |
| CUDA Graphs / HIP Graphs | Stable | Mostly stable | Occasional hangs on RDNA |
| bitsandbytes 4-bit | Yes | Beta | Works for inference, training iffy |
Inference stacks
vLLM has the best ROCm coverage of the three major inference servers. SGLang added ROCm support mid-2025 and is now usable. Hugging Face TGI supports ROCm on MI250 and MI300 but not consumer parts. For NVIDIA, all three run natively on Blackwell out of the box.
| Stack | CUDA | ROCm MI300 | ROCm RDNA |
|---|---|---|---|
| vLLM | Full | Full | Core features |
| SGLang | Full | Full | Beta |
| TGI | Full | Full | No |
| TensorRT-LLM | Full | N/A | N/A |
| llama.cpp | Full | Full | Full |
Performance delta
On Llama 3.1 70B FP16 at batch 32, a single MI300X 192GB delivers around 1.05-1.10x the throughput of a single H100 80GB, largely because it avoids tensor parallelism. Against an H200 141GB the MI300X is roughly at parity. On Llama 3.1 8B at batch 16, an RX 7900 XTX lands about 15-20% behind an RTX 3090 and 35% behind an RTX 4090. At the low end the RTX 5060 Ti pulls ahead of any RDNA 16 GB card thanks to FP8 hardware. See the 5060 Ti vLLM setup for numbers.
Price per GB of VRAM
| GPU | VRAM | Street price | £ per GB |
|---|---|---|---|
| MI300X | 192 GB | £13,500 | £70 |
| H100 80GB SXM | 80 GB | £22,000 | £275 |
| H200 141GB | 141 GB | £24,000 | £170 |
| RX 7900 XTX | 24 GB | £780 | £33 |
| RTX 3090 (used) | 24 GB | £720 | £30 |
| RTX 5090 | 32 GB | £2,200 | £69 |
| RTX 6000 Pro 96GB | 96 GB | £8,400 | £88 |
When ROCm wins
Large open-weight models where HBM capacity matters. A single MI300X fits Llama 70B in FP16 with 120 GB of KV-cache headroom; getting the same on NVIDIA requires an H200 or 2x H100. Training runs where bandwidth and capacity dominate. Long-context inference (128k+) where KV cache dwarfs weights.
When CUDA still wins
Day-one model support (new architectures land on CUDA first). Complex multimodal stacks with custom kernels. Small-GPU deployments (consumer NVIDIA has richer driver and FP8 support). Windows-based workflows. Third-party SaaS and managed products.
Run your workload on ROCm or CUDA
MI300X, RTX 5090, RTX 6000 Pro and more. UK dedicated hosting.
Browse GPU ServersSee also: vLLM on ROCm, 3090 vs 5090, 5060 Ti vs 5080, upgrading to RTX 6000 Pro.