Cross-NUMA Memory Access Adds 40% Latency to Every Inference Request
Your dual-socket server has GPUs physically wired to CPU socket 0, but the inference process runs on cores attached to socket 1. Every memory access crosses the interconnect between NUMA nodes, adding 40-80ns per access compared to local memory. Over millions of accesses per forward pass, this compounds into measurable latency degradation. On a multi-socket GPU server, NUMA-unaware scheduling silently degrades AI inference performance by 15-30%.
Understand Your NUMA Topology
Before optimizing, map the relationship between CPUs, memory, and GPUs:
# Show NUMA topology
numactl --hardware
# Example output (dual-socket):
# available: 2 nodes (0-1)
# node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# node 0 size: 128000 MB
# node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
# node 1 size: 128000 MB
# node distances:
# node 0 1
# 0: 10 21
# 1: 21 10
# Distance 10 = local, 21 = remote (~2.1x slower)
# GPU-to-NUMA mapping
nvidia-smi topo -m
# Shows which NUMA node each GPU is closest to
# GPU0 GPU1 CPU Affinity NUMA Affinity
# GPU0 X NV12 0-15 0
# GPU1 NV12 X 0-15 0
# Both GPUs on NUMA node 0
# Alternative check
for gpu in /sys/bus/pci/devices/*/numa_node; do
echo "$(dirname $gpu | xargs basename): NUMA $(cat $gpu)"
done
# lscpu for CPU layout
lscpu | grep -E "NUMA|Socket|Core|Thread"
Bind Inference to the Correct NUMA Node
Pin inference processes to the NUMA node closest to their GPUs:
# If GPUs are on NUMA node 0, bind inference to node 0
numactl --cpunodebind=0 --membind=0 \
python3 -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-70B-Instruct \
--tensor-parallel-size 2 \
--port 8000
# --cpunodebind=0: Only run on CPUs from NUMA node 0
# --membind=0: Only allocate memory from NUMA node 0
# For systemd service files
# /etc/systemd/system/vllm-inference.service
[Service]
ExecStart=/usr/bin/numactl --cpunodebind=0 --membind=0 \
/opt/envs/vllm/bin/python -m vllm.entrypoints.openai.api_server \
--model /opt/models/llama-3-70b \
--tensor-parallel-size 2 \
--port 8000
Environment=CUDA_VISIBLE_DEVICES=0,1
# If GPUs span NUMA nodes (e.g., GPU0 on node 0, GPU2 on node 1)
# Use interleaved memory policy instead
numactl --interleave=0,1 \
python3 -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-70B-Instruct \
--tensor-parallel-size 4
# Verify binding after launch
taskset -cp $(pgrep -f vllm)
# Should show CPUs only from the target NUMA node
numastat -p $(pgrep -f vllm)
# Should show memory allocated primarily on target node
Measure NUMA Impact on Inference
# Benchmark: correct NUMA binding vs wrong NUMA binding
# Test 1: Bound to correct NUMA node (GPU's node)
echo "=== Correct NUMA binding ==="
numactl --cpunodebind=0 --membind=0 \
python3 -c "
import torch, time
model = torch.nn.Linear(4096, 4096).half().cuda()
x = torch.randn(32, 4096, device='cuda', dtype=torch.float16)
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(10000):
y = model(x)
torch.cuda.synchronize()
print(f'Time: {(time.perf_counter()-start)*1000:.1f}ms')
"
# Test 2: Bound to wrong NUMA node (remote)
echo "=== Wrong NUMA binding ==="
numactl --cpunodebind=1 --membind=1 \
python3 -c "
import torch, time
model = torch.nn.Linear(4096, 4096).half().cuda()
x = torch.randn(32, 4096, device='cuda', dtype=torch.float16)
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(10000):
y = model(x)
torch.cuda.synchronize()
print(f'Time: {(time.perf_counter()-start)*1000:.1f}ms')
"
# Expected: 10-30% performance difference
# The gap widens with:
# - Larger data transfers between CPU and GPU
# - More CPU-side processing (tokenization)
# - Smaller batch sizes (more frequent kernel launches)
# Monitor NUMA memory access patterns
numastat -m
# Watch for "Other Node" allocations — these are cross-NUMA
Multi-GPU NUMA Configuration
# Common server topologies:
#
# 2-socket, 4 GPUs: GPU 0,1 on NUMA 0 / GPU 2,3 on NUMA 1
# 2-socket, 8 GPUs: GPU 0-3 on NUMA 0 / GPU 4-7 on NUMA 1
# 1-socket, 8 GPUs: All GPUs on NUMA 0 (no NUMA concern)
# For tensor parallelism across NUMA nodes:
# Prefer GPUs within the same NUMA node for TP groups
# Cross-NUMA TP adds latency to all-reduce operations
# Good: TP across GPUs on same NUMA node
CUDA_VISIBLE_DEVICES=0,1 numactl --cpunodebind=0 --membind=0 \
vllm serve model --tensor-parallel-size 2
# Less optimal: TP across NUMA nodes
CUDA_VISIBLE_DEVICES=0,2 numactl --interleave=0,1 \
vllm serve model --tensor-parallel-size 2
# For pipeline parallelism: NUMA crossing is less impactful
# PP only sends activations between stages (smaller transfers)
# TP sends all-reduce traffic every layer (frequent, latency-sensitive)
# Multi-instance serving: one instance per NUMA node
# Instance 1 (NUMA 0, GPU 0,1)
numactl --cpunodebind=0 --membind=0 \
CUDA_VISIBLE_DEVICES=0,1 \
vllm serve model --port 8000 --tensor-parallel-size 2 &
# Instance 2 (NUMA 1, GPU 2,3)
numactl --cpunodebind=1 --membind=1 \
CUDA_VISIBLE_DEVICES=2,3 \
vllm serve model --port 8001 --tensor-parallel-size 2 &
# Load balance between instances with Nginx
Advanced NUMA Tuning
# Tune page migration policy
# Prevent kernel from auto-migrating pages across NUMA nodes
echo 0 | sudo tee /proc/sys/kernel/numa_balancing
# Auto-balancing sounds good but causes unpredictable latency spikes
# during page migration — disable for latency-sensitive inference
# Hugepages for reduced TLB misses
# Large models with mmap benefit from 2MB hugepages
echo 65536 | sudo tee /proc/sys/vm/nr_hugepages # 128GB of hugepages
# Mount hugetlbfs
sudo mount -t hugetlbfs nodev /mnt/hugepages
# Verify hugepage usage
cat /proc/meminfo | grep -i huge
# CPU pinning with specific cores (finer than NUMA binding)
taskset -c 0-7 python3 -m vllm.entrypoints.openai.api_server \
--model model --port 8000
# Pins to cores 0-7 specifically, avoiding scheduler migration
# IRQ affinity: steer network interrupts to the right NUMA node
# Find the IRQ for your network interface
cat /proc/interrupts | grep eth0
# Set affinity to NUMA node 0 CPUs
echo 0000ffff > /proc/irq/IRQ_NUMBER/smp_affinity
# Ensures network I/O processing stays on the same NUMA node as inference
# Complete NUMA-optimized launch script
#!/bin/bash
GPU_NUMA_NODE=0
CPUS="0-15"
GPUS="0,1"
echo 0 > /proc/sys/kernel/numa_balancing
numactl --cpunodebind=$GPU_NUMA_NODE --membind=$GPU_NUMA_NODE \
taskset -c $CPUS \
env CUDA_VISIBLE_DEVICES=$GPUS \
/opt/envs/vllm/bin/python -m vllm.entrypoints.openai.api_server \
--model /opt/models/llama-3-70b \
--tensor-parallel-size 2 \
--port 8000
NUMA-aware configuration extracts the full performance potential from multi-socket GPU servers. Deploy vLLM with proper CPU affinity using the production guide. Measure the impact against our token benchmarks. Monitor NUMA behavior with our GPU monitoring setup. Install PyTorch with our setup guide. Browse more benchmarks and infrastructure guides.
Multi-Socket GPU Servers
GigaGPU dedicated servers with dual-socket configurations and NVLink-equipped GPUs. NUMA-optimized for maximum AI inference performance.
Browse GPU Servers