RTX 3050 - Order Now
Home / Blog / AI Hosting & Infrastructure / CPU-GPU Offload Strategy for 70B Models
AI Hosting & Infrastructure

CPU-GPU Offload Strategy for 70B Models

When VRAM is tight, CPU offload lets you run models that would not otherwise fit. The cost is speed - here is how much.

If a 70B model is just beyond your VRAM budget on dedicated GPU hosting, CPU offload is tempting. You put some layers on the GPU and the rest on system RAM, swapping in and out as inference progresses. It works. It is also dramatically slower than keeping everything on the GPU. Here is when it is worth it.

Sections

How It Works

llama.cpp and a handful of other frameworks support placing some transformer layers on CPU memory. For each forward pass, weights for CPU-side layers are swapped to the GPU, math runs, then the next set comes in. The bottleneck is PCIe bandwidth for the weight transfer plus the much slower CPU RAM if any math runs on CPU.

The Speed Cost

On a 24 GB RTX 3090 trying to run Llama 3 70B Q4:

  • Full GPU fit (not possible here): would be ~40 tokens/sec
  • 50% GPU / 50% CPU offload: ~4-6 tokens/sec
  • 20% GPU / 80% CPU: ~2-3 tokens/sec

Offload is roughly 10x slower than full GPU. Not a marginal penalty – a qualitative one. Typing speed drops below conversational fluency.

When to Use It

Offload is fine for these cases:

  • Experimentation. You want to try a 70B model for an afternoon without buying a bigger server.
  • Batch jobs with no latency target. Overnight document summarisation where 3 tokens/sec is acceptable.
  • Very rare queries – a tool that fires once a week and tolerates slow response.

Offload is wrong for:

  • Any interactive end-user workload.
  • Production APIs with SLAs.
  • High-concurrency serving – offload kills throughput further because multiple requests compete for the PCIe bus.

Full GPU Fit For Real Workloads

We size servers so your model lives entirely on the GPU – no offload tax.

Browse GPU Servers

Better Alternatives

First: pick a more aggressive quantisation. A 70B at Q3_K_M or IQ3_XS can fit a 24 GB card. Quality degrades but less than you think, and speed stays native.

Second: step up to a single RTX 5090 (32 GB) which handles 70B Q4 natively.

Third: step up to the RTX 6000 Pro (96 GB) and run 70B at INT8 or even FP16 for a distilled variant.

Fourth: run two 24 GB cards in tensor parallel. See tensor vs pipeline parallelism.

For the specific Ryzen AI Max+ 395 case where “unified memory” blurs this line, see unified memory vs dedicated VRAM.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?