RTX 3050 - Order Now
Home / Blog / AI Hosting & Infrastructure / How Much RAM Do You Need for AI Inference?
AI Hosting & Infrastructure

How Much RAM Do You Need for AI Inference?

Guide to system RAM requirements for AI inference workloads. Covers RAM needs for LLMs, image generation, RAG pipelines, and multi-model serving with sizing recommendations.

The Role of System RAM in AI Inference

System RAM (DDR4/DDR5) plays a different role from GPU VRAM in AI inference. While VRAM holds active model weights and computation buffers, system RAM handles model loading, preprocessing, API request queues, and CPU offloading. On a dedicated GPU server, insufficient RAM can bottleneck even a powerful GPU. Understanding how much system RAM you need prevents out-of-memory errors and ensures smooth multi-model deployments.

RAM Requirements by Workload

WorkloadMinimum RAMRecommended RAMNotes
Single 7B LLM (vLLM)16 GB32 GBModel loaded to GPU, RAM for OS + serving
Single 7B LLM (llama.cpp CPU offload)32 GB64 GBPartial model weights in RAM
Image generation (SDXL/Flux)16 GB32 GBModel loading + image buffers
RAG pipeline + ChromaDB32 GB64 GBVector index resides in RAM
Multi-model serving (2-3 models)32 GB64 GBModel offloading between GPU and RAM
Large model offloading (70B+)64 GB128 GBMost weights in RAM, layers streamed to GPU

A general rule is to have at least 2x your model’s FP16 weight size in system RAM. This accounts for the model loading process (which temporarily holds both the disk and GPU copies) plus OS and serving framework overhead.

Model Loading and CPU Offloading

When a model loads, the serving framework reads weights from disk into system RAM, then transfers them to VRAM. This means system RAM must be large enough to hold the full model temporarily. With CPU offloading (common in llama.cpp and Hugging Face Accelerate), some model layers remain permanently in system RAM and are streamed to GPU layer by layer during inference.

For example, running a 70B model with LLaMA on a 24 GB GPU using CPU offloading requires approximately 100 GB of system RAM to hold the offloaded layers. See our LLaMA 3 VRAM requirements for detailed sizing and the vLLM vs Ollama guide for framework-specific RAM usage.

RAM for RAG and Vector Databases

Vector databases like ChromaDB and FAISS store their indices in system RAM. The RAM required scales with the number of vectors and their dimensionality:

Document CountEmbedding DimensionsIndex RAM (approximate)
100K documents768~1 GB
1M documents768~8 GB
10M documents768~80 GB
1M documents1536~15 GB

For most RAG deployments with under 1M documents, 32 GB of system RAM is sufficient. Large-scale deployments with millions of documents may require 64-128 GB. See our ChromaDB + LLM VRAM for RAG guide for the full pipeline analysis.

Sizing Recommendations

Use CaseGPURecommended RAM
Budget single modelRTX 406016-32 GB
Production LLM servingRTX 309032-64 GB
Multi-model pipelineRTX 309064 GB
RAG with large corpusRTX 309064-128 GB
CPU offloading (70B+)Any128 GB+

The safest general recommendation is 32 GB for single-model deployments and 64 GB for production multi-model serving. RAM is relatively inexpensive compared to GPU upgrades, so over-provisioning is cost-effective.

Next Steps

System RAM is just one piece of the infrastructure puzzle. For GPU memory sizing, see our GPU memory vs system RAM comparison. For storage planning, read how much storage for AI models. Compare GPU options with the GPU comparisons tool. Browse all infrastructure guides in the AI hosting and infrastructure section.

Dedicated GPU Servers with Flexible RAM

Configure your dedicated GPU server with 16 GB to 128 GB+ system RAM. Optimised for AI inference workloads with UK data centre hosting.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?