If a 70B model is just beyond your VRAM budget on dedicated GPU hosting, CPU offload is tempting. You put some layers on the GPU and the rest on system RAM, swapping in and out as inference progresses. It works. It is also dramatically slower than keeping everything on the GPU. Here is when it is worth it.
Sections
How It Works
llama.cpp and a handful of other frameworks support placing some transformer layers on CPU memory. For each forward pass, weights for CPU-side layers are swapped to the GPU, math runs, then the next set comes in. The bottleneck is PCIe bandwidth for the weight transfer plus the much slower CPU RAM if any math runs on CPU.
The Speed Cost
On a 24 GB RTX 3090 trying to run Llama 3 70B Q4:
- Full GPU fit (not possible here): would be ~40 tokens/sec
- 50% GPU / 50% CPU offload: ~4-6 tokens/sec
- 20% GPU / 80% CPU: ~2-3 tokens/sec
Offload is roughly 10x slower than full GPU. Not a marginal penalty – a qualitative one. Typing speed drops below conversational fluency.
When to Use It
Offload is fine for these cases:
- Experimentation. You want to try a 70B model for an afternoon without buying a bigger server.
- Batch jobs with no latency target. Overnight document summarisation where 3 tokens/sec is acceptable.
- Very rare queries – a tool that fires once a week and tolerates slow response.
Offload is wrong for:
- Any interactive end-user workload.
- Production APIs with SLAs.
- High-concurrency serving – offload kills throughput further because multiple requests compete for the PCIe bus.
Full GPU Fit For Real Workloads
We size servers so your model lives entirely on the GPU – no offload tax.
Browse GPU ServersBetter Alternatives
First: pick a more aggressive quantisation. A 70B at Q3_K_M or IQ3_XS can fit a 24 GB card. Quality degrades but less than you think, and speed stays native.
Second: step up to a single RTX 5090 (32 GB) which handles 70B Q4 natively.
Third: step up to the RTX 6000 Pro (96 GB) and run 70B at INT8 or even FP16 for a distilled variant.
Fourth: run two 24 GB cards in tensor parallel. See tensor vs pipeline parallelism.
For the specific Ryzen AI Max+ 395 case where “unified memory” blurs this line, see unified memory vs dedicated VRAM.