When a model will not fit on your dedicated GPU, you have two cheap alternatives before buying more VRAM: CPU RAM offload and disk (NVMe) offload. They have similar-sounding descriptions and dramatically different performance profiles.
Topics
CPU RAM Offload
Layers live in system RAM and are streamed to GPU VRAM as inference runs. Frameworks like llama.cpp, Hugging Face Accelerate, and DeepSpeed support this. The bottleneck is PCIe bandwidth – roughly 30-50 GB/s on a modern dedicated server.
Throughput for a 70B Q4 model offloaded 50% to CPU: typically 3-6 tokens/sec.
NVMe Disk Offload
Layers live on an NVMe drive. Hit the layer, read from disk, stage to CPU, stage to GPU, run. Disk bandwidth on enterprise NVMe is roughly 7 GB/s sequential – an order of magnitude slower than RAM. DeepSpeed ZeRO-Infinity supports this pattern for training; it is rarer in serving.
Throughput for the same 70B Q4 model disk-offloaded: typically 0.5-1.5 tokens/sec. Below useful conversational speed.
Head to Head
| Metric | CPU RAM Offload | NVMe Disk Offload |
|---|---|---|
| Typical speed | 3-6 t/s | 0.5-1.5 t/s |
| Setup complexity | Low (one flag in most frameworks) | Higher (needs DeepSpeed or similar) |
| RAM required | Model size + OS | Minimal RAM |
| Good for | Experimentation, batch | Training only, rarely serving |
| Common frameworks | llama.cpp, Accelerate | DeepSpeed ZeRO-Infinity |
Skip Offload – Fit on the GPU
Our team sizes servers to your model so offload is never the right answer.
Browse GPU ServersWhen Either Is Acceptable
CPU offload: fine for one-off experimentation, occasional batch summaries, research exploration. Do not productionise it.
Disk offload: essentially only useful during training of models too large for CPU RAM. For serving, disk offload is too slow to be a solution – it is an anti-solution.
A practical rule: if your monthly cost of a bigger GPU is less than your user-hours lost to offload latency, upgrade. See CPU-GPU offload strategy for 70B for the deeper speed comparison and break-even vs OpenAI on the 5090 for the budget math.