Table of Contents
The Role of System RAM in AI Inference
System RAM (DDR4/DDR5) plays a different role from GPU VRAM in AI inference. While VRAM holds active model weights and computation buffers, system RAM handles model loading, preprocessing, API request queues, and CPU offloading. On a dedicated GPU server, insufficient RAM can bottleneck even a powerful GPU. Understanding how much system RAM you need prevents out-of-memory errors and ensures smooth multi-model deployments.
RAM Requirements by Workload
| Workload | Minimum RAM | Recommended RAM | Notes |
|---|---|---|---|
| Single 7B LLM (vLLM) | 16 GB | 32 GB | Model loaded to GPU, RAM for OS + serving |
| Single 7B LLM (llama.cpp CPU offload) | 32 GB | 64 GB | Partial model weights in RAM |
| Image generation (SDXL/Flux) | 16 GB | 32 GB | Model loading + image buffers |
| RAG pipeline + ChromaDB | 32 GB | 64 GB | Vector index resides in RAM |
| Multi-model serving (2-3 models) | 32 GB | 64 GB | Model offloading between GPU and RAM |
| Large model offloading (70B+) | 64 GB | 128 GB | Most weights in RAM, layers streamed to GPU |
A general rule is to have at least 2x your model’s FP16 weight size in system RAM. This accounts for the model loading process (which temporarily holds both the disk and GPU copies) plus OS and serving framework overhead.
Model Loading and CPU Offloading
When a model loads, the serving framework reads weights from disk into system RAM, then transfers them to VRAM. This means system RAM must be large enough to hold the full model temporarily. With CPU offloading (common in llama.cpp and Hugging Face Accelerate), some model layers remain permanently in system RAM and are streamed to GPU layer by layer during inference.
For example, running a 70B model with LLaMA on a 24 GB GPU using CPU offloading requires approximately 100 GB of system RAM to hold the offloaded layers. See our LLaMA 3 VRAM requirements for detailed sizing and the vLLM vs Ollama guide for framework-specific RAM usage.
RAM for RAG and Vector Databases
Vector databases like ChromaDB and FAISS store their indices in system RAM. The RAM required scales with the number of vectors and their dimensionality:
| Document Count | Embedding Dimensions | Index RAM (approximate) |
|---|---|---|
| 100K documents | 768 | ~1 GB |
| 1M documents | 768 | ~8 GB |
| 10M documents | 768 | ~80 GB |
| 1M documents | 1536 | ~15 GB |
For most RAG deployments with under 1M documents, 32 GB of system RAM is sufficient. Large-scale deployments with millions of documents may require 64-128 GB. See our ChromaDB + LLM VRAM for RAG guide for the full pipeline analysis.
Sizing Recommendations
| Use Case | GPU | Recommended RAM |
|---|---|---|
| Budget single model | RTX 4060 | 16-32 GB |
| Production LLM serving | RTX 3090 | 32-64 GB |
| Multi-model pipeline | RTX 3090 | 64 GB |
| RAG with large corpus | RTX 3090 | 64-128 GB |
| CPU offloading (70B+) | Any | 128 GB+ |
The safest general recommendation is 32 GB for single-model deployments and 64 GB for production multi-model serving. RAM is relatively inexpensive compared to GPU upgrades, so over-provisioning is cost-effective.
Next Steps
System RAM is just one piece of the infrastructure puzzle. For GPU memory sizing, see our GPU memory vs system RAM comparison. For storage planning, read how much storage for AI models. Compare GPU options with the GPU comparisons tool. Browse all infrastructure guides in the AI hosting and infrastructure section.
Dedicated GPU Servers with Flexible RAM
Configure your dedicated GPU server with 16 GB to 128 GB+ system RAM. Optimised for AI inference workloads with UK data centre hosting.
Browse GPU Servers