RTX 3050 - Order Now
Home / Blog / AI Hosting & Infrastructure / Swap Space for AI Inference
AI Hosting & Infrastructure

Swap Space for AI Inference

Configure swap space correctly for AI inference workloads. Covers sizing for model loading, swappiness tuning, SSD-backed swap, zram, and preventing OOM kills on GPU servers.

The OOM Killer Just Terminated Your Inference Server

Your model loads a 14 GB checkpoint into system RAM before transferring to GPU. During loading, PyTorch needs the model in CPU memory plus working buffers, and your 32 GB server runs out. The Linux OOM killer steps in and terminates the largest process — your inference server. Proper swap configuration on your GPU server prevents this by providing overflow memory for transient spikes without degrading GPU performance.

Sizing Swap for AI Workloads

AI workloads have different swap requirements than traditional servers:

# Check current memory and swap
free -h
#               total   used   free   shared  buff/cache  available
# Mem:           62Gi   45Gi   2.1Gi   128Mi      15Gi       16Gi
# Swap:            0B     0B     0B

# Rule of thumb for AI inference servers:
# - System RAM < 64 GB:  swap = RAM (e.g., 64 GB swap for 64 GB RAM)
# - System RAM 64-256 GB: swap = 0.5x RAM
# - System RAM > 256 GB:  swap = 32-64 GB (overflow only)

# Model loading is the peak — a 70B model in FP16 needs ~140 GB
# If you have 128 GB RAM, you need swap to handle the loading spike
# Once on GPU, CPU memory is released

# Create swap file (NVMe-backed for speed)
sudo fallocate -l 64G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

# Make permanent
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

# Verify
swapon --show
# NAME      TYPE  SIZE  USED  PRIO
# /swapfile file   64G    0B    -2

Tuning vm.swappiness for GPU Workloads

The swappiness parameter controls how aggressively Linux moves pages to swap:

# Check current swappiness (default is 60)
cat /proc/sys/vm/swappiness

# For AI inference: set low swappiness
# Swap should only be used as emergency overflow, not proactively
# swappiness=10: only swap when RAM is nearly full
# swappiness=1:  swap only to avoid OOM (most aggressive avoidance)
# swappiness=0:  disable swap entirely (NOT recommended)

sudo sysctl vm.swappiness=10
echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf

# Why low swappiness matters for AI:
# - Model weights in RAM should NEVER be swapped (kills inference speed)
# - Data loader prefetch buffers should stay in RAM
# - Only inactive pages (old logs, unused libraries) should swap
# - High swappiness causes random latency spikes during inference

# Additional memory tuning
sudo sysctl vm.vfs_cache_pressure=50     # Keep directory cache longer
sudo sysctl vm.dirty_ratio=10            # Flush dirty pages earlier
sudo sysctl vm.dirty_background_ratio=5  # Background writeback threshold

SSD-Backed vs HDD-Backed Swap

Swap on NVMe is orders of magnitude faster than on spinning disks:

# Benchmark swap device speed
sudo hdparm -t /dev/nvme0n1  # NVMe: ~3500 MB/s
sudo hdparm -t /dev/sda      # HDD: ~150 MB/s

# If you must use swap, put it on the fastest available drive
# NVMe swap is ~20x faster than HDD swap

# For multiple swap devices, set priorities
# Higher priority = used first
sudo swapon --priority=100 /dev/nvme0n1p2  # NVMe partition
sudo swapon --priority=10  /swapfile        # Fallback on slower drive

# In /etc/fstab:
# /dev/nvme0n1p2 none swap sw,pri=100 0 0
# /swapfile      none swap sw,pri=10  0 0

# Monitor swap I/O to detect performance issues
vmstat 1 5
#  procs ---memory--- ---swap--
#  r  b   swpd   free   si   so
#  1  0      0  2048M    0    0   # Good: no swap activity
#  3  2  8192M   128M  450  200   # Bad: heavy swapping

Zram: Compressed In-Memory Swap

Zram creates a compressed swap device in RAM — faster than any SSD:

# Install and enable zram
sudo apt install -y zram-tools

# Configure zram (compressed RAM swap)
sudo tee /etc/default/zramswap <<'EOF'
ALGO=zstd
PERCENT=25
PRIORITY=100
EOF

sudo systemctl enable --now zramswap

# Verify zram is active
zramctl
# NAME       ALGORITHM  DISKSIZE  DATA  COMPR  TOTAL  STREAMS
# /dev/zram0 zstd         15.5G  1.2G  320M   380M       16

# Zram compresses at ~3:1 ratio for typical data
# 16 GB zram ≈ 48 GB effective swap capacity
# Latency: microseconds (vs milliseconds for NVMe)

# For AI servers, combine zram + NVMe swap:
# Priority 100: zram (fastest, compressed RAM)
# Priority 50:  NVMe swap file (fast, for overflow)
# This gives you maximum headroom for model loading spikes

Monitoring Swap Under AI Workloads

# Real-time swap monitoring during model load
watch -n 1 'free -h; echo "---"; swapon --show; echo "---"; vmstat 1 1 | tail -1'

# Alert if swap usage exceeds threshold
#!/bin/bash
THRESHOLD=80
USAGE=$(free | grep Swap | awk '{if($2>0) print int($3/$2*100); else print 0}')
if [ "$USAGE" -gt "$THRESHOLD" ]; then
    echo "WARNING: Swap usage at ${USAGE}% — server may need more RAM"
fi

# Track swap usage over time
sar -S 1 60 > /var/log/swap-usage.log &

Proper swap configuration prevents OOM kills on your GPU server without sacrificing inference performance. For PyTorch memory management, see the PyTorch setup guide. Monitor overall GPU health with our GPU monitoring guide. Check the infrastructure section for related server configuration, tutorials for hands-on guides, and benchmarks for memory throughput data.

GPU Servers with Ample RAM

GigaGPU dedicated servers with up to 512 GB DDR5 RAM. Load the largest models without worrying about OOM.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?