RTX 3050 - Order Now
Strix Halo APU · 128 GB Unified · 70B-Class Fit

AMD Ryzen AI MAX+ 395 Hosting — The 70B Single-Host King

128 GB of unified LPDDR5X memory shared between a 16-core Zen 5 CPU, a 40-CU RDNA3.5 iGPU, and a 50 TOPS XDNA2 NPU — on a single 120 W die. The cheapest way to load a 70B FP8 model on a single host. Trades raw throughput for unprecedented model-fit headroom.

128 GB unified memory 70B FP8 fits on one host 120 W whole-SoC TDP From £299/mo
128 GB
Unified LPDDR5X
16C / 32T
Zen 5 CPU @ 5.1 GHz
50 TOPS
XDNA2 NPU (INT8)
£299
/mo from

Ryzen AI MAX+ 395 Server Specs

The hardware you actually rent.

ProcessorAMD Ryzen AI MAX+ 395 (Strix Halo APU)
ArchitectureZen 5 CPU + RDNA3.5 iGPU + XDNA2 NPU on a single die
Unified memory128 GB LPDDR5X-8000 (shared across CPU + iGPU + NPU)
CPU16 cores / 32 threads Zen 5, up to 5.1 GHz
iGPURadeon 8060S — 40 RDNA3.5 CUs, ~59 TFLOPS FP16
NPUXDNA2 — 50 TOPS INT8 (sustained low-power inference)
TDP120 W (entire SoC)
Software stackROCm for the iGPU, Ryzen AI SDK / Optimum-AMD for the NPU
Storage1 TB NVMe + 4 TB SATA SSD
Network1 Gbps unmetered
LocationLondon, United Kingdom

What Fits on a Single MAX+ 395

128 GB of unified memory is a category-killer at this price. The MAX+ 395 fits 70B-class models on a single host — no tensor parallel, no NCCL, no model splitting. This is the unique selling point.

ModelParamsFootprintFitNotes
Llama 3.3 70B70B~70 GB FP8ComfortableThe headline workload — KV cache room for 32K+ context
Qwen 2.5 72B72B~70 GB FP8ComfortableMultilingual + coding flagship at FP8
DeepSeek 67B67B~67 GB INT8ComfortableReasoning-tier model with full context budget
Mixtral 8x22B141B (MoE)~80 GB INT4FitsLargest MoE you’ll fit on any single-host system at this price
Multi-model serving13B + 8B + 7B + embeddings~50 GB combinedComfortableRun an LLM stack, RAG, and embeddings together
FLUX.1 dev + SDXL + Whisper L-v3~38 GB combinedFP16 + FP16 + FP16ComfortableCo-hosted image + audio + LLM pipeline
Llama 3.1 8B (long context)8B~16 GB + 30 GB KVHeadroom128K+ token context window with room to spare
Qwen 2.5 14B14B~28 GB FP16HeadroomFull FP16 with multi-tenant context budgets
Embeddings + reranker (BGE-large)0.5B~2 GBHeadroomStack alongside any of the above

When the MAX+ 395 Is the Right Box

Real customer workloads we run on this hardware every day.

Run a 70B model on £299/mo

The cheapest single-host 70B option in the catalogue. Llama 3.3 70B FP8 or Qwen 2.5 72B FP8 fit comfortably with 32K+ context. No tensor parallel, no NCCL, no model splitting — just load and serve.

Llama 3.3 70BQwen 2.5 72BDeepSeek 67B

Multi-model serving stack

Run an LLM, an embedding model, a reranker, a TTS engine, and an ASR model on one host. The 128 GB envelope lets you keep everything resident instead of swapping models in and out.

LLM + embeddingsTTS + ASRRAG stack

Long-context document analysis

128K+ token contexts on 8B-class models with massive KV cache headroom. Think contract review, codebase analysis, long-form summarisation — workloads where context length matters more than tokens-per-second.

128K contextCodebase RAGDocument QA

Power-constrained inference

120 W for the whole SoC — CPU, iGPU, and NPU combined. Roughly a third of a 5080’s 360 W. The XDNA2 NPU handles sustained INT8 inference at very low watts when latency matters less than energy budget.

XDNA2 NPU120 W envelopeEdge analogue

Local-dev mirror

If your team prototypes on Strix Halo laptops or mini-PCs, this is the closest server-side analogue. Same APU, same memory architecture, same software stack — but in a 24/7 hosted box you can point your CI at.

Strix Halo devCI inferencePre-prod mirror

Batch inference jobs

Where throughput-per-dollar matters less than model-fit. Overnight summarisation runs, large-scale embeddings backfills, evaluation passes against 70B judges — anywhere the queue can absorb the latency hit but the model has to be big.

Batch summarisationEmbeddings backfillLLM-as-judge

MAX+ 395 vs Other Large-Memory Options

How the 128 GB unified envelope stacks up against the alternatives in the GigaGPU catalogue.

SystemUsable memory70B FP8 single-host?Throughput / NotesPrice
Ryzen AI MAX+ 395128 GB unified LPDDR5XYes — comfortableLower single-stream tok/s, big batch and big-context winsfrom £299
RTX 6000 PRO96 GB GDDR7 ECCYes — with FP4 hardwareHighest throughput in catalogue, ECC, FP4 nativefrom £899
RTX 409048 GB combinedOnly with tensor parallelFast individually, model splitting overheadfrom £578
Radeon AI Pro R970032 GB GDDR6No — 32B-class onlyFaster discrete iGPU class, smaller memoryfrom £199
RTX 509032 GB GDDR7No — 70B INT4 only with tight KV cacheBest single-stream tok/s under £500from £399

Deep Dive

Why unified memory changes the maths

Discrete GPUs have a hard wall: VRAM. A 5090 stops at 32 GB. A 4090 stops at 24 GB. Once your model + KV cache exceeds that, you either quantise harder, shrink your context, or split across cards with tensor parallel — which adds NCCL, latency, and operational complexity.

The MAX+ 395 throws that wall away. 128 GB of LPDDR5X is shared between the Zen 5 CPU, the Radeon 8060S iGPU, and the XDNA2 NPU — there’s no PCIe transfer cost between them, and any of the three can address the full pool. For workloads where model-fit is the bottleneck (and that’s most of the 70B-class market), this is a genuinely different category of machine.

Where the MAX+ 395 isn’t the right pick

The honest answer: single-stream throughput. The Radeon 8060S iGPU is roughly comparable to a desktop RX 7600 — it’s a real GPU, but it’s not a 5090. If your bottleneck is “tokens per second for one user typing into a chat box,” a 4090 at £289 will outpace it.

The MAX+ 395 also lacks FP8 hardware on the iGPU yet — quantisation is software-emulated through ROCm or routed through the XDNA2 NPU’s INT8 path. That’s fine for memory savings, but it means the 2× FP8 speedup you get on Blackwell tensor cores doesn’t apply here. Plan around INT8 / INT4 on the NPU and FP16 / BF16 on the iGPU.

The XDNA2 NPU is for sustained low-power inference

50 TOPS of INT8 is not 5090-class compute, but it has a different shape: very low watts per token under sustained load. The NPU is accessed via the Ryzen AI SDK and Hugging Face Optimum-AMD — most production teams will route batch inference, embeddings, and ASR/TTS through the NPU and keep the iGPU for the bigger LLM that’s holding the bulk of the unified memory.

The combination — iGPU for the 70B model, NPU for the supporting cast, all sharing the same 128 GB pool — is what makes this box interesting. It’s not a faster GPU. It’s a single-host architecture that lets you stop juggling.

Pricing context: £299 vs the alternatives

The closest single-host 70B option in the catalogue is the RTX 6000 PRO at £899. The 6000 PRO is meaningfully faster, has ECC, and ships with FP4 hardware — but it’s 3× the price. For teams who need 70B-class fit on a budget and are willing to trade single-stream throughput, the MAX+ 395 is the cheapest way in.

  • £299 vs £899 6000 PRO — same model fits, ~3× the throughput on the 6000 PRO.
  • £299 vs £578 for 2× 4090 — same memory, no tensor parallel hassle on the 395.
  • £299 vs £199 R9700 — 4× the memory, ~50% premium.

Pick by bottleneck: if it’s throughput, pay for the 6000 PRO. If it’s model-fit on a budget, the MAX+ 395 is the right call.

Frequently Asked Questions

The questions buyers actually ask before committing to a Strix Halo box.

Can I really run Llama 3.3 70B on this?

Yes. At FP8 the weights take ~70 GB, leaving roughly 50 GB for KV cache, OS, and headroom. We have customers running 70B models with 32K context windows on this box without issue. Single-stream tok/s is lower than a 5090 — expect roughly 8–15 tok/s for a 70B FP8 — but it fits on one host with no model splitting.

Is this faster than a 4090 or 5090?

For models that fit on a 4090/5090, no — those discrete GPUs have meaningfully more raw compute and bandwidth. The MAX+ 395 wins where the model doesn’t fit elsewhere. Pick by bottleneck: throughput vs model-fit.

Does ROCm actually work on the iGPU?

Yes. The Radeon 8060S is RDNA3.5 and supported by ROCm. Most teams use ROCm + PyTorch directly. The XDNA2 NPU is accessed separately via the Ryzen AI SDK and Hugging Face Optimum-AMD — it’s a different code path but mature for INT8 inference.

What about FP8?

The iGPU doesn’t have native FP8 hardware yet — quantisation is software-emulated through ROCm. You still get the memory savings (which is the point on a 70B model), but you don’t get the 2× FP8 speedup that Blackwell tensor cores give you. INT8 on the NPU is hardware-accelerated.

Can I fine-tune on this?

QLoRA on 13B–34B models works fine in the unified memory pool. Full SFT on a 70B isn’t realistic — you’d want a 6000 PRO or a multi-GPU box for that. The MAX+ 395 is an inference-first machine.

How does it compare to the R9700?

The Radeon AI Pro R9700 is a discrete GPU with 32 GB GDDR6 — faster on workloads that fit in 32 GB, but it can’t load a 70B model. The MAX+ 395 has 4× the memory at 50% premium. Pick by model size.

Power draw at 100% load?

120 W for the whole SoC. That’s roughly a third of a 5080’s 360 W and an eighth of a 6000 PRO. The TDP advantage is real if your hosting bill includes power.

Same-day deployment?

Yes for in-stock SKUs. Strix Halo supply is tighter than mainstream GPUs — out-of-stock lead time is 3–5 working days.

Need 70B-class fit on a budget? This is the box.

128 GB of unified LPDDR5X, Zen 5 + RDNA3.5 + XDNA2 on one die, 120 W total. The cheapest single-host 70B FP8 option in the catalogue. From £299/mo with 3–5 day deployment.

Have a question? Need help?