The AMD Instinct MI300X is one of the most interesting datacentre accelerators of the decade: 192 GB of HBM3 in a single OAM module, 5.3 TB/s of memory bandwidth, 304 CDNA 3 compute units, and 750W of power. The RTX 4090 24GB is a £1,300 consumer card with 24 GB of GDDR6X and 1 TB/s of bandwidth. Comparing them directly is unfair on both ends, but understanding where each fits clarifies what you are actually buying when you provision UK GPU hosting. The MI300X is a Llama 405B / DeepSeek V3 / Mixtral 8x22B card; the 4090 is a Llama 8B / Qwen 32B / FLUX card. Same general purpose, completely different tier.
Contents
- Spec sheet side by side
- 192GB and what it unlocks
- ROCm on MI300X — production reality
- Per-workload throughput comparison
- Power, price and economics
- Per-workload winner table
- Production gotchas with MI300X
- Verdict
Spec sheet side by side
| Spec | RTX 4090 (Ada AD102) | MI300X (CDNA 3) | Delta |
|---|---|---|---|
| Process | TSMC 4N | TSMC N5 + N6 (chiplet) | Different package |
| Compute units | 128 SMs | 304 CUs | 2.4x |
| Matrix throughput (FP16) | 165 TFLOPS | 1300 TFLOPS | 7.9x |
| FP8 throughput | 660 TFLOPS sparse | 2600 TFLOPS | 3.9x |
| VRAM | 24 GB GDDR6X | 192 GB HBM3 | 8x capacity |
| Memory bandwidth | 1008 GB/s | 5.3 TB/s | 5.26x |
| L2 / Infinity cache | 72 MB L2 | 256 MB Infinity Cache | 3.5x |
| Interconnect | PCIe Gen 4 x16 | Infinity Fabric 896 GB/s + PCIe Gen 5 | Datacentre class |
| FP8 native | E4M3 + E5M2 | E4M3 + E5M2 | Same |
| TDP | 450W | 750W | +67% |
| Form factor | 3.5-slot consumer | OAM module | Server only |
| Approx UK price (2026) | £1,300 | £15,000+ | 11x |
The MI300X is in a different league: 8x the VRAM, 5.3x the bandwidth, ~4x the FP8 throughput. It also costs 11x more and requires an OAM-compatible chassis (typically a Supermicro or Dell HGX-class server) that costs another £30k+ kitted. You do not buy an MI300X to serve Llama 8B.
192GB and what it unlocks
| Model / configuration | RTX 4090 24GB | MI300X 192GB |
|---|---|---|
| Llama 3.1 8B FP8 | Comfortable | Trivial |
| Llama 3.1 70B FP8 (35 GB) | OOM | Trivial |
| Llama 3.1 70B BF16 (140 GB) | OOM | Comfortable |
| Llama 3.1 405B AWQ INT4 (~210 GB) | OOM | OOM (single) |
| Llama 3.1 405B FP4 microscaling (~110 GB) | OOM | Comfortable |
| Mixtral 8x22B BF16 (~280 GB) | OOM | OOM (single) |
| DeepSeek V3 671B FP8 (~370 GB) | OOM | OOM (need 2x) |
| Qwen 2.5 72B FP8 | OOM | Comfortable |
| 200 concurrent Llama 8B sessions | OOM | Comfortable |
| Heavy MoE serving (Mixtral 8x7B + KV) | OOM at scale | Comfortable |
192GB unlocks: every dense model up to 70B BF16 on a single card, every MoE up to Mixtral 8x22B FP8, and very high-concurrency serving. The frontier 400B+ models still need multi-card.
ROCm on MI300X — production reality
ROCm 6.3+ is a credible production stack for the MI300X. vLLM, SGLang and TensorRT-style serving stacks all have AMD-supported builds. Performance is competitive on supported models (Llama, Mistral, Qwen) — within 10-20% of equivalent NVIDIA on the throughput-per-bandwidth ratio. The lag persists on the latest kernels: FlashAttention-3 took months to land; FlashInfer paged-attention variants lag; some Mamba and state-space kernels are absent. For Llama-class transformer inference at scale, the MI300X delivers; for cutting-edge research, the 4090 (or H100) sees new kernels first.
For UK hosting, the MI300X is rare on consumer-facing platforms — typically you rent capacity from cloud-provider HGX clusters (Azure, AWS) or a specialised AMD-focused integrator. A 4090 you can rack in any UK datacentre. See vs RX 9070 XT for the consumer AMD comparison.
Per-workload throughput comparison
| Workload | RTX 4090 | MI300X | MI300X / 4090 |
|---|---|---|---|
| Llama 3.1 8B FP8 decode b1 | 198 t/s | ~280 t/s | 1.41x |
| Llama 3.1 8B FP8 batch 64 agg | 1140 t/s | ~3500 t/s | 3.07x |
| Llama 3.1 70B AWQ b1 | 22-24 t/s | ~75 t/s | 3.13x |
| Llama 3.1 70B FP8 b1 | OOM | ~95 t/s | MI300X only |
| Llama 3.1 70B BF16 b1 | OOM | ~52 t/s | MI300X only |
| Mixtral 8x22B FP8 | OOM | ~62 t/s | MI300X only |
| Qwen 2.5 72B FP8 | OOM | ~46 t/s | MI300X only |
| 200 concurrent Llama 8B | OOM at KV | ~12,000 agg t/s | MI300X only |
| SDXL 1024×1024 | 2.0s | ~1.4s | 1.43x |
| QLoRA Llama 8B (steps/s) | 2.6 | ~7.5 | 2.88x |
For workloads both run, the MI300X is 1.4-3.0x faster. For workloads only the MI300X runs, the comparison is moot. The killer feature is concurrency: a single MI300X can serve dozens of large-model sessions, where a 4090 caps at single digits.
Power, price and economics
| Metric | RTX 4090 | MI300X |
|---|---|---|
| TDP | 450W | 750W |
| Sustained LLM b32 | 360W | ~620W |
| UK price (2026) | £1,300 | £15,000+ |
| Server / chassis required | 4U with 12V-2×6 | Specialist OAM HGX (~£30k) |
| £/aggregate t/s b32 (Llama 8B) | £1.18 | £12.86 |
| £/aggregate t/s b64 (Llama 8B) | £1.14 | £4.29 |
| £/year electricity @ 24/7 | £568 | £978 |
| £/GB VRAM | £54 | £78 |
For Llama 8B, the 4090 wins on £/token by a factor of 4-10x. For Llama 70B FP8, the comparison is meaningless because the 4090 cannot run it. The MI300X earns its premium only when you genuinely need the VRAM and bandwidth for very large or very high-concurrency workloads.
Per-workload winner table
| Workload | Winner | Why |
|---|---|---|
| 200-MAU SaaS RAG on Llama 8B | 4090 | 10x cheaper, more than enough |
| 12-engineer Qwen 32B AWQ | 4090 | Fits, MI300X overkill |
| Llama 70B FP8 production at scale | MI300X | 4090 cannot fit |
| Llama 405B FP4 | MI300X | Only single-card option |
| Mixtral 8x22B endpoint | MI300X | 4090 OOM |
| 500+ concurrent 8B sessions | MI300X | 4090 KV cache exhausted |
| FLUX.1-dev hobby | 4090 | MI300X overkill |
| LLM training (full pretrain) | MI300X (cluster) | 4090 capacity insufficient |
| Cutting-edge research | 4090 | CUDA-first kernels |
| UK-located hosting under £2k/mo | 4090 | MI300X capacity scarce in UK |
Production gotchas with MI300X
- OAM-only form factor. Cannot drop into a standard PCIe slot. Requires HGX or OAM-compatible chassis costing £30k+.
- UK availability is thin. Most UK MI300X capacity is in Azure (UK South region). On-premises hosting is rare.
- ROCm version sensitivity. Pin a specific ROCm version (6.3.x in 2026) and validate every model on it. Across-version regressions are real.
- Cooling: liquid or aggressive air. 750W in a single OAM module wants serious airflow; many older HGX chassis cannot handle it.
- Driver update windows are long. Production AMD driver upgrades require fleet-wide validation. Not a “yum update” affair.
- NCCL equivalent (RCCL) maturity. Multi-MI300X all-reduce is competitive but documentation is thinner than NCCL.
- Capex commitment. A 4090 is a £1,300 risk. An MI300X is a £15k commitment per card, plus chassis, plus support contract.
Verdict
- Pick the RTX 4090 24GB if your model fits in 24GB; you serve fewer than ~50 concurrent users; you are price-sensitive; or you need UK-located hosting with predictable lead times.
- Pick the MI300X 192GB if you need to serve Llama 70B FP8 / Llama 405B FP4 / Mixtral 8x22B / Qwen 72B FP8 on a single card, you serve hundreds of concurrent users on smaller models, or you have an internal AMD/ROCm competency.
- Pick neither if you specifically need NVIDIA datacentre features (MIG, NCCL, CUDA Graph) — go to H100 80GB.
For a 200-MAU SaaS, the 4090 is the right answer. For a regional bank running a Llama 70B FP8 audit-grade endpoint at 100+ concurrent users, the MI300X (or H100) is the only credible choice.
Start where the workload actually lives
GigaGPU’s UK dedicated hosting offers the RTX 4090 24GB — the right size, in the right country, with the right software stack — for the workloads that don’t need a 192GB datacentre accelerator.
Order the RTX 4090 24GBSee also: vs H100 80GB, vs A100 80GB, vs AMD RX 9070 XT, vs RTX 6000 Pro 96GB, RTX 4090 spec breakdown, 2026 tier positioning, multi-card pairing.