The OOM Killer Just Terminated Your Inference Server
Your model loads a 14 GB checkpoint into system RAM before transferring to GPU. During loading, PyTorch needs the model in CPU memory plus working buffers, and your 32 GB server runs out. The Linux OOM killer steps in and terminates the largest process — your inference server. Proper swap configuration on your GPU server prevents this by providing overflow memory for transient spikes without degrading GPU performance.
Sizing Swap for AI Workloads
AI workloads have different swap requirements than traditional servers:
# Check current memory and swap
free -h
# total used free shared buff/cache available
# Mem: 62Gi 45Gi 2.1Gi 128Mi 15Gi 16Gi
# Swap: 0B 0B 0B
# Rule of thumb for AI inference servers:
# - System RAM < 64 GB: swap = RAM (e.g., 64 GB swap for 64 GB RAM)
# - System RAM 64-256 GB: swap = 0.5x RAM
# - System RAM > 256 GB: swap = 32-64 GB (overflow only)
# Model loading is the peak — a 70B model in FP16 needs ~140 GB
# If you have 128 GB RAM, you need swap to handle the loading spike
# Once on GPU, CPU memory is released
# Create swap file (NVMe-backed for speed)
sudo fallocate -l 64G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
# Make permanent
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
# Verify
swapon --show
# NAME TYPE SIZE USED PRIO
# /swapfile file 64G 0B -2
Tuning vm.swappiness for GPU Workloads
The swappiness parameter controls how aggressively Linux moves pages to swap:
# Check current swappiness (default is 60)
cat /proc/sys/vm/swappiness
# For AI inference: set low swappiness
# Swap should only be used as emergency overflow, not proactively
# swappiness=10: only swap when RAM is nearly full
# swappiness=1: swap only to avoid OOM (most aggressive avoidance)
# swappiness=0: disable swap entirely (NOT recommended)
sudo sysctl vm.swappiness=10
echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf
# Why low swappiness matters for AI:
# - Model weights in RAM should NEVER be swapped (kills inference speed)
# - Data loader prefetch buffers should stay in RAM
# - Only inactive pages (old logs, unused libraries) should swap
# - High swappiness causes random latency spikes during inference
# Additional memory tuning
sudo sysctl vm.vfs_cache_pressure=50 # Keep directory cache longer
sudo sysctl vm.dirty_ratio=10 # Flush dirty pages earlier
sudo sysctl vm.dirty_background_ratio=5 # Background writeback threshold
SSD-Backed vs HDD-Backed Swap
Swap on NVMe is orders of magnitude faster than on spinning disks:
# Benchmark swap device speed
sudo hdparm -t /dev/nvme0n1 # NVMe: ~3500 MB/s
sudo hdparm -t /dev/sda # HDD: ~150 MB/s
# If you must use swap, put it on the fastest available drive
# NVMe swap is ~20x faster than HDD swap
# For multiple swap devices, set priorities
# Higher priority = used first
sudo swapon --priority=100 /dev/nvme0n1p2 # NVMe partition
sudo swapon --priority=10 /swapfile # Fallback on slower drive
# In /etc/fstab:
# /dev/nvme0n1p2 none swap sw,pri=100 0 0
# /swapfile none swap sw,pri=10 0 0
# Monitor swap I/O to detect performance issues
vmstat 1 5
# procs ---memory--- ---swap--
# r b swpd free si so
# 1 0 0 2048M 0 0 # Good: no swap activity
# 3 2 8192M 128M 450 200 # Bad: heavy swapping
Zram: Compressed In-Memory Swap
Zram creates a compressed swap device in RAM — faster than any SSD:
# Install and enable zram
sudo apt install -y zram-tools
# Configure zram (compressed RAM swap)
sudo tee /etc/default/zramswap <<'EOF'
ALGO=zstd
PERCENT=25
PRIORITY=100
EOF
sudo systemctl enable --now zramswap
# Verify zram is active
zramctl
# NAME ALGORITHM DISKSIZE DATA COMPR TOTAL STREAMS
# /dev/zram0 zstd 15.5G 1.2G 320M 380M 16
# Zram compresses at ~3:1 ratio for typical data
# 16 GB zram ≈ 48 GB effective swap capacity
# Latency: microseconds (vs milliseconds for NVMe)
# For AI servers, combine zram + NVMe swap:
# Priority 100: zram (fastest, compressed RAM)
# Priority 50: NVMe swap file (fast, for overflow)
# This gives you maximum headroom for model loading spikes
Monitoring Swap Under AI Workloads
# Real-time swap monitoring during model load
watch -n 1 'free -h; echo "---"; swapon --show; echo "---"; vmstat 1 1 | tail -1'
# Alert if swap usage exceeds threshold
#!/bin/bash
THRESHOLD=80
USAGE=$(free | grep Swap | awk '{if($2>0) print int($3/$2*100); else print 0}')
if [ "$USAGE" -gt "$THRESHOLD" ]; then
echo "WARNING: Swap usage at ${USAGE}% — server may need more RAM"
fi
# Track swap usage over time
sar -S 1 60 > /var/log/swap-usage.log &
Proper swap configuration prevents OOM kills on your GPU server without sacrificing inference performance. For PyTorch memory management, see the PyTorch setup guide. Monitor overall GPU health with our GPU monitoring guide. Check the infrastructure section for related server configuration, tutorials for hands-on guides, and benchmarks for memory throughput data.
GPU Servers with Ample RAM
GigaGPU dedicated servers with up to 512 GB DDR5 RAM. Load the largest models without worrying about OOM.
Browse GPU Servers