AMD Ryzen AI MAX+ 395 Hosting — The 70B Single-Host King
128 GB of unified LPDDR5X memory shared between a 16-core Zen 5 CPU, a 40-CU RDNA3.5 iGPU, and a 50 TOPS XDNA2 NPU — on a single 120 W die. The cheapest way to load a 70B FP8 model on a single host. Trades raw throughput for unprecedented model-fit headroom.
Ryzen AI MAX+ 395 Server Specs
The hardware you actually rent.
| Processor | AMD Ryzen AI MAX+ 395 (Strix Halo APU) |
|---|---|
| Architecture | Zen 5 CPU + RDNA3.5 iGPU + XDNA2 NPU on a single die |
| Unified memory | 128 GB LPDDR5X-8000 (shared across CPU + iGPU + NPU) |
| CPU | 16 cores / 32 threads Zen 5, up to 5.1 GHz |
| iGPU | Radeon 8060S — 40 RDNA3.5 CUs, ~59 TFLOPS FP16 |
| NPU | XDNA2 — 50 TOPS INT8 (sustained low-power inference) |
| TDP | 120 W (entire SoC) |
| Software stack | ROCm for the iGPU, Ryzen AI SDK / Optimum-AMD for the NPU |
| Storage | 1 TB NVMe + 4 TB SATA SSD |
| Network | 1 Gbps unmetered |
| Location | London, United Kingdom |
What Fits on a Single MAX+ 395
128 GB of unified memory is a category-killer at this price. The MAX+ 395 fits 70B-class models on a single host — no tensor parallel, no NCCL, no model splitting. This is the unique selling point.
| Model | Params | Footprint | Fit | Notes |
|---|---|---|---|---|
| Llama 3.3 70B | 70B | ~70 GB FP8 | Comfortable | The headline workload — KV cache room for 32K+ context |
| Qwen 2.5 72B | 72B | ~70 GB FP8 | Comfortable | Multilingual + coding flagship at FP8 |
| DeepSeek 67B | 67B | ~67 GB INT8 | Comfortable | Reasoning-tier model with full context budget |
| Mixtral 8x22B | 141B (MoE) | ~80 GB INT4 | Fits | Largest MoE you’ll fit on any single-host system at this price |
| Multi-model serving | 13B + 8B + 7B + embeddings | ~50 GB combined | Comfortable | Run an LLM stack, RAG, and embeddings together |
| FLUX.1 dev + SDXL + Whisper L-v3 | ~38 GB combined | FP16 + FP16 + FP16 | Comfortable | Co-hosted image + audio + LLM pipeline |
| Llama 3.1 8B (long context) | 8B | ~16 GB + 30 GB KV | Headroom | 128K+ token context window with room to spare |
| Qwen 2.5 14B | 14B | ~28 GB FP16 | Headroom | Full FP16 with multi-tenant context budgets |
| Embeddings + reranker (BGE-large) | 0.5B | ~2 GB | Headroom | Stack alongside any of the above |
When the MAX+ 395 Is the Right Box
Real customer workloads we run on this hardware every day.
Run a 70B model on £299/mo
The cheapest single-host 70B option in the catalogue. Llama 3.3 70B FP8 or Qwen 2.5 72B FP8 fit comfortably with 32K+ context. No tensor parallel, no NCCL, no model splitting — just load and serve.
Multi-model serving stack
Run an LLM, an embedding model, a reranker, a TTS engine, and an ASR model on one host. The 128 GB envelope lets you keep everything resident instead of swapping models in and out.
Long-context document analysis
128K+ token contexts on 8B-class models with massive KV cache headroom. Think contract review, codebase analysis, long-form summarisation — workloads where context length matters more than tokens-per-second.
Power-constrained inference
120 W for the whole SoC — CPU, iGPU, and NPU combined. Roughly a third of a 5080’s 360 W. The XDNA2 NPU handles sustained INT8 inference at very low watts when latency matters less than energy budget.
Local-dev mirror
If your team prototypes on Strix Halo laptops or mini-PCs, this is the closest server-side analogue. Same APU, same memory architecture, same software stack — but in a 24/7 hosted box you can point your CI at.
Batch inference jobs
Where throughput-per-dollar matters less than model-fit. Overnight summarisation runs, large-scale embeddings backfills, evaluation passes against 70B judges — anywhere the queue can absorb the latency hit but the model has to be big.
MAX+ 395 vs Other Large-Memory Options
How the 128 GB unified envelope stacks up against the alternatives in the GigaGPU catalogue.
| System | Usable memory | 70B FP8 single-host? | Throughput / Notes | Price |
|---|---|---|---|---|
| Ryzen AI MAX+ 395 | 128 GB unified LPDDR5X | Yes — comfortable | Lower single-stream tok/s, big batch and big-context wins | from £299 |
| RTX 6000 PRO | 96 GB GDDR7 ECC | Yes — with FP4 hardware | Highest throughput in catalogue, ECC, FP4 native | from £899 |
| 2× RTX 4090 | 48 GB combined | Only with tensor parallel | Fast individually, model splitting overhead | from £578 |
| Radeon AI Pro R9700 | 32 GB GDDR6 | No — 32B-class only | Faster discrete iGPU class, smaller memory | from £199 |
| RTX 5090 | 32 GB GDDR7 | No — 70B INT4 only with tight KV cache | Best single-stream tok/s under £500 | from £399 |
Deep Dive
Why unified memory changes the maths
Discrete GPUs have a hard wall: VRAM. A 5090 stops at 32 GB. A 4090 stops at 24 GB. Once your model + KV cache exceeds that, you either quantise harder, shrink your context, or split across cards with tensor parallel — which adds NCCL, latency, and operational complexity.
The MAX+ 395 throws that wall away. 128 GB of LPDDR5X is shared between the Zen 5 CPU, the Radeon 8060S iGPU, and the XDNA2 NPU — there’s no PCIe transfer cost between them, and any of the three can address the full pool. For workloads where model-fit is the bottleneck (and that’s most of the 70B-class market), this is a genuinely different category of machine.
Where the MAX+ 395 isn’t the right pick
The honest answer: single-stream throughput. The Radeon 8060S iGPU is roughly comparable to a desktop RX 7600 — it’s a real GPU, but it’s not a 5090. If your bottleneck is “tokens per second for one user typing into a chat box,” a 4090 at £289 will outpace it.
The MAX+ 395 also lacks FP8 hardware on the iGPU yet — quantisation is software-emulated through ROCm or routed through the XDNA2 NPU’s INT8 path. That’s fine for memory savings, but it means the 2× FP8 speedup you get on Blackwell tensor cores doesn’t apply here. Plan around INT8 / INT4 on the NPU and FP16 / BF16 on the iGPU.
The XDNA2 NPU is for sustained low-power inference
50 TOPS of INT8 is not 5090-class compute, but it has a different shape: very low watts per token under sustained load. The NPU is accessed via the Ryzen AI SDK and Hugging Face Optimum-AMD — most production teams will route batch inference, embeddings, and ASR/TTS through the NPU and keep the iGPU for the bigger LLM that’s holding the bulk of the unified memory.
The combination — iGPU for the 70B model, NPU for the supporting cast, all sharing the same 128 GB pool — is what makes this box interesting. It’s not a faster GPU. It’s a single-host architecture that lets you stop juggling.
Pricing context: £299 vs the alternatives
The closest single-host 70B option in the catalogue is the RTX 6000 PRO at £899. The 6000 PRO is meaningfully faster, has ECC, and ships with FP4 hardware — but it’s 3× the price. For teams who need 70B-class fit on a budget and are willing to trade single-stream throughput, the MAX+ 395 is the cheapest way in.
- £299 vs £899 6000 PRO — same model fits, ~3× the throughput on the 6000 PRO.
- £299 vs £578 for 2× 4090 — same memory, no tensor parallel hassle on the 395.
- £299 vs £199 R9700 — 4× the memory, ~50% premium.
Pick by bottleneck: if it’s throughput, pay for the 6000 PRO. If it’s model-fit on a budget, the MAX+ 395 is the right call.
Frequently Asked Questions
The questions buyers actually ask before committing to a Strix Halo box.
Can I really run Llama 3.3 70B on this?
Yes. At FP8 the weights take ~70 GB, leaving roughly 50 GB for KV cache, OS, and headroom. We have customers running 70B models with 32K context windows on this box without issue. Single-stream tok/s is lower than a 5090 — expect roughly 8–15 tok/s for a 70B FP8 — but it fits on one host with no model splitting.
Is this faster than a 4090 or 5090?
For models that fit on a 4090/5090, no — those discrete GPUs have meaningfully more raw compute and bandwidth. The MAX+ 395 wins where the model doesn’t fit elsewhere. Pick by bottleneck: throughput vs model-fit.
Does ROCm actually work on the iGPU?
Yes. The Radeon 8060S is RDNA3.5 and supported by ROCm. Most teams use ROCm + PyTorch directly. The XDNA2 NPU is accessed separately via the Ryzen AI SDK and Hugging Face Optimum-AMD — it’s a different code path but mature for INT8 inference.
What about FP8?
The iGPU doesn’t have native FP8 hardware yet — quantisation is software-emulated through ROCm. You still get the memory savings (which is the point on a 70B model), but you don’t get the 2× FP8 speedup that Blackwell tensor cores give you. INT8 on the NPU is hardware-accelerated.
Can I fine-tune on this?
QLoRA on 13B–34B models works fine in the unified memory pool. Full SFT on a 70B isn’t realistic — you’d want a 6000 PRO or a multi-GPU box for that. The MAX+ 395 is an inference-first machine.
How does it compare to the R9700?
The Radeon AI Pro R9700 is a discrete GPU with 32 GB GDDR6 — faster on workloads that fit in 32 GB, but it can’t load a 70B model. The MAX+ 395 has 4× the memory at 50% premium. Pick by model size.
Power draw at 100% load?
120 W for the whole SoC. That’s roughly a third of a 5080’s 360 W and an eighth of a 6000 PRO. The TDP advantage is real if your hosting bill includes power.
Same-day deployment?
Yes for in-stock SKUs. Strix Halo supply is tighter than mainstream GPUs — out-of-stock lead time is 3–5 working days.
Related Pages
Pages our visitors typically read next.
Need 70B-class fit on a budget? This is the box.
128 GB of unified LPDDR5X, Zen 5 + RDNA3.5 + XDNA2 on one die, 120 W total. The cheapest single-host 70B FP8 option in the catalogue. From £299/mo with 3–5 day deployment.