Table of Contents
VRAM vs RAM: Key Differences
GPU VRAM (Video RAM) and system RAM (DDR4/DDR5) serve fundamentally different purposes in AI inference. VRAM is high-bandwidth memory directly attached to the GPU, running at 256-1,792 GB/s depending on the card. System RAM connects through the CPU at 50-100 GB/s (DDR5) or 25-50 GB/s (DDR4). On a dedicated GPU server, both are essential, but they constrain different aspects of your AI workload.
| Property | GPU VRAM | System RAM |
|---|---|---|
| Bandwidth | 256-1,792 GB/s | 25-100 GB/s |
| Typical capacity | 6-32 GB | 16-128 GB |
| Cost per GB | High (tied to GPU price) | Low (~$3-5/GB) |
| Primary role in AI | Model weights + computation | Model loading, offloading, services |
| Upgradeable | No (fixed per GPU) | Yes (add DIMMs) |
When VRAM Matters Most
VRAM is the single most important factor for AI inference performance. It determines which models you can run, at what precision, and with what context length. Every active model weight, KV cache entry, and intermediate computation must fit in VRAM for GPU-accelerated inference.
- Model loading: A 7B FP16 model needs ~14 GB VRAM. If you only have 8 GB, you must quantise.
- Context length: Longer context consumes more VRAM for KV cache. See our LLaMA 3 VRAM requirements guide.
- Batch inference: Serving multiple concurrent users requires VRAM for each user’s KV cache.
- Image/video generation: Resolution and frame count scale VRAM usage quadratically and linearly.
For GPU VRAM sizing, see our best GPU for LLM inference guide. Compare VRAM tiers in the GPU comparisons tool.
When System RAM Matters Most
System RAM becomes critical in specific scenarios:
- CPU offloading: Running models larger than VRAM (e.g., 70B on 24 GB) by keeping some layers in system RAM. Inference speed drops to RAM bandwidth (~50 GB/s vs ~900 GB/s for the RTX 3090).
- RAG and vector databases: ChromaDB and FAISS indices live in system RAM, scaling with document count.
- Data preprocessing: Image resizing, audio conversion, and text tokenisation happen on CPU using system RAM.
- Model loading: Weights are staged through system RAM before transferring to GPU. Insufficient RAM causes swapping to disk.
- Multi-model serving: When swapping models between GPU and CPU, inactive models reside in system RAM.
See our RAM requirements for AI inference guide for detailed sizing by workload.
How VRAM and RAM Work Together
During a typical inference request, the flow is: client request arrives in system RAM, the serving framework tokenises input (CPU/RAM), tensors are transferred to VRAM for GPU computation, output tokens are generated on GPU, then results are transferred back through system RAM to the client. If VRAM is insufficient, some frameworks can offload layers to system RAM, but this trades 10-20x bandwidth for more capacity.
The key insight is that VRAM bandwidth determines inference speed (tokens per second), while system RAM capacity determines what you can load and serve. Insufficient VRAM forces quantisation or offloading. Insufficient RAM causes system-level out-of-memory errors or disk swapping.
How to Balance Your Budget
| Budget Priority | Reasoning |
|---|---|
| 1. GPU VRAM (choose the right GPU) | Cannot be upgraded. Determines model capability. |
| 2. System RAM (32-64 GB) | Cheap to add. Prevents system-level failures. |
| 3. NVMe storage | Affects model loading speed. See storage guide. |
| 4. CPU | Least critical for inference. See CPU guide. |
Spend the majority of your budget on the right GPU since VRAM cannot be upgraded. Then ensure you have at least 32 GB of system RAM. Going from 32 GB to 64 GB RAM costs far less than upgrading from an RTX 4060 to an RTX 3090.
Sizing Recommendations
| Workload | GPU | VRAM | Recommended RAM |
|---|---|---|---|
| 7B LLM, single user | RTX 4060 | 8 GB | 16-32 GB |
| 7-8B LLM, production | RTX 4060 Ti | 16 GB | 32 GB |
| 13B+ LLM or image gen | RTX 3090 | 24 GB | 32-64 GB |
| RAG pipeline | RTX 3090 | 24 GB | 64 GB |
| Multi-model pipeline | RTX 3090 | 24 GB | 64-128 GB |
Use the LLM cost calculator to model your specific workload and the cheapest GPU for inference guide to minimise costs.
Find the Right GPU + RAM Configuration
GigaGPU offers dedicated GPU servers with flexible RAM configurations. Match your VRAM tier with the right amount of system memory for optimal AI performance.
Browse GPU Servers