The AMD Radeon RX 9070 XT is RDNA 4’s flagship: 16GB GDDR6 at 640 GB/s, 304W TDP, and roughly £550 in the UK in 2026. It is half the price of a RTX 4090 24GB. For gaming the contest is genuinely close — for AI inference on UK GPU hosting, the gap is wider, and the gap is entirely about software. ROCm has improved dramatically since the MI300X launch, but vLLM, FlashInfer, AWQ-Marlin, FP8 Transformer Engine and the wider CUDA ecosystem remain decisively NVIDIA-first. This post quantifies the gap and tells you when AMD’s price advantage actually pays back.
Contents
- Spec sheet side by side
- CUDA vs ROCm — the real bottleneck
- FP8, AWQ and the quantisation kernel gap
- Throughput across eight workloads
- Power, price and value
- Per-workload winner table
- Production gotchas with AMD
- Verdict
Spec sheet side by side
| Spec | RTX 4090 (Ada AD102) | RX 9070 XT (RDNA 4) | Delta |
|---|---|---|---|
| Process | TSMC 4N | TSMC N4P | Similar |
| Compute units | 128 SMs | 64 CUs (4096 stream procs) | Hard to compare |
| Tensor / matrix throughput | 660 TFLOPS FP8 sparse | ~390 TFLOPS FP8 | 1.7x NVIDIA |
| VRAM | 24 GB GDDR6X (21 Gbps) | 16 GB GDDR6 (20 Gbps) | +50% NVIDIA |
| Memory bandwidth | 1008 GB/s | 640 GB/s | +58% NVIDIA |
| Memory bus | 384-bit | 256-bit | +50% NVIDIA |
| L2 / Infinity cache | 72 MB L2 | 64 MB Infinity Cache | Similar |
| FP8 native | E4M3 + E5M2 (Transformer Engine) | E4M3 + E5M2 (newer) | NVIDIA more mature |
| TDP | 450W | 304W | NVIDIA +48% |
| PCIe | Gen 4 x16 | Gen 5 x16 | Even effective |
| vLLM support | Native, day-1 | Via ROCm fork, lagging | NVIDIA decisive |
On silicon the 9070 XT is competitive. On software it is not. RDNA 4 added FP8 matrix instructions but the kernel ecosystem to exploit them is still catching up. ROCm 6.3+ has vLLM support but ships several months behind upstream features. AWQ-Marlin, FlashInfer paged attention and the latest GPTQ kernels lag or are missing entirely.
CUDA vs ROCm — the real bottleneck
NVIDIA’s CUDA ecosystem is the moat. vLLM ships first on CUDA. FlashAttention-3 ships first on CUDA. Diffusers’ optimal paths assume CUDA. AWQ Marlin, GPTQ Marlin, EXL2 and the bleeding edge of LLM-serving kernels all start as CUDA implementations. ROCm catches up — sometimes within weeks, sometimes within months — but the gap means a production deployment on AMD lives 3-6 months behind NVIDIA on average. For a research lab that can tolerate this, the price advantage is real. For a production inference service that must ship today, the lag costs more than the hardware saves.
RDNA 4 is the first AMD consumer architecture with serious AI matrix instructions. The MI300X has had them for two years, but the consumer line lagged. The 9070 XT closes that gap on paper. Whether it closes in practice depends on how quickly hipBLASLt, MIOpen and the AMD-side kernel libraries catch up to cuBLAS, cuDNN and the CUDA Math Libraries.
FP8, AWQ and the quantisation kernel gap
FP8 inference quality on RDNA 4 is comparable to Ada when both run identical models. The throughput is lower because the kernels are less mature. Llama 3 8B FP8 on the 9070 XT through ROCm vLLM hits ~135 t/s decode batch 1, versus the 4090’s 198 t/s — a 32% deficit despite RDNA 4’s modern tensor instructions. AWQ INT4 is worse: the equivalent of Marlin is in development but the production path on AMD is still through GGUF or naïve INT4 kernels, costing another 20-30%.
Throughput across eight workloads
| Workload | RTX 4090 | RX 9070 XT | 4090 / 9070 XT |
|---|---|---|---|
| Llama 3.1 8B FP8 decode b1 | 198 t/s | 135 t/s | 1.47x |
| Llama 3.1 8B FP8 batch 16 agg | 880 t/s | 520 t/s | 1.69x |
| Mistral 7B FP8 decode b1 | 215 t/s | 150 t/s | 1.43x |
| Qwen 2.5 14B AWQ decode b1 | 135 t/s | ~80 t/s (GGUF Q4) | 1.69x |
| Qwen 2.5 32B AWQ | 65 t/s | OOM | 4090 only |
| SDXL 1024×1024 30-step | 2.0s | 3.2s | 1.60x |
| FLUX.1-dev FP8 30-step | 4.1s | ~7.5s (FP16, OOM risk) | 1.83x |
| Whisper large-v3-turbo INT8 | 80x RT | ~50x RT | 1.60x |
The 4090 is consistently 1.4-1.8x faster despite paying only 1.7x more bandwidth on paper. The extra delta is the software gap.
Power, price and value
| Metric | RTX 4090 | RX 9070 XT |
|---|---|---|
| TDP | 450W | 304W |
| Sustained LLM b16 | 340W | 240W |
| Tokens/Joule | ~2.59 | ~2.17 |
| UK price (typical 2026) | £1,300 | £550 |
| £/decode t/s (b1) | £6.57 | £4.07 |
| £/aggregate t/s (b16) | £1.48 | £1.06 |
| £/GB VRAM | £54 | £34 |
| Annual electricity @ 24/7 £0.18/kWh | £537 | £378 |
On £/token, the 9070 XT wins by 28-38%. On capability and software maturity, the 4090 wins decisively. The choice depends on which axis you optimise.
Per-workload winner table
| Workload | Winner | Why |
|---|---|---|
| 200-MAU SaaS RAG on Llama 8B | 4090 | Mature toolchain, more concurrent |
| Bootstrap MVP, AMD-friendly team | 9070 XT | £750 saved vs 4090 |
| 12-engineer Qwen 32B AWQ | 4090 | 9070 XT 16GB OOM |
| FLUX.1-dev studio | 4090 | FP16 path, mature kernels |
| SDXL hobby | 9070 XT | 3.2s/image, much cheaper |
| Voice agent (Whisper + 8B) | 4090 | 50x RT works but 80x is comfortable |
| Llama 70B INT4 | 4090 | 9070 XT cannot fit |
| Cutting-edge research (latest kernels) | 4090 | CUDA-first ecosystem |
| Datacentre at scale | 4090 (or H100) | Mature monitoring, drivers |
| AMD-shop bias / political constraint | 9070 XT | If you must run AMD, the 9070 XT is fine |
Production gotchas with AMD
- ROCm release cadence lags upstream. Plan for 1-3 month lag on vLLM features. Bleeding-edge LLM kernels arrive on AMD second.
- FlashAttention path is slower. The current ROCm Flash kernels are competitive but not yet at FA3 parity.
- AWQ on AMD goes through GGUF or naïve INT4. Marlin-class kernels do not have an AMD twin yet — you give up 20-30% throughput on quantised models.
- Triton-on-ROCm is improving but not perfect. Some custom Triton kernels (notably some FlashInfer paged-attention variants) run slower or fall back to PyTorch eager.
- Driver / firmware release surprises. AMD’s enterprise driver branch is on a separate cadence from gaming; production teams should pin a known-good combo.
- 16GB on the 9070 XT is enforcing. No 24GB SKU exists; you cannot grow into Qwen 32B without changing cards.
- Documentation surface is thinner. The vast majority of community deployment guides assume CUDA. Expect more troubleshooting on AMD.
Verdict
- Pick the RTX 4090 24GB if you need production inference today; you want vLLM, AWQ Marlin, FlashInfer and FlashAttention-3 on the latest tag; you serve more than a hobbyist load; you need 24GB; or you simply value not debugging ROCm.
- Pick the RX 9070 XT if you are bootstrapped, you have a team comfortable with ROCm and willing to lag the CUDA toolchain by months, you serve light loads, and you want the lowest hardware capex.
- Pick neither if you need datacentre-class capacity — go to H100 80GB or MI300X 192GB.
For a 200-MAU SaaS RAG, the 4090 is the right answer. For a hobby project that doesn’t mind ROCm friction, the 9070 XT saves £750 and runs Llama 8B at usable speeds.
Skip the ROCm friction
GigaGPU’s UK dedicated hosting offers the RTX 4090 24GB pre-flighted for vLLM, FlashAttention and the full CUDA stack — production inference without the toolchain bring-up.
Order the RTX 4090 24GBSee also: vs MI300X 192GB, vs RTX 5080 16GB, RTX 4090 spec breakdown, vLLM setup, FP8 Llama deployment, 2026 tier positioning, tokens-per-watt.