Home / Blog / Benchmarks / NUMA-Aware AI Inference Optimization

Benchmarks

NUMA-Aware AI Inference Optimization

Optimize AI inference with NUMA-aware configuration on multi-socket GPU servers. Covers NUMA topology, CPU-GPU affinity, memory binding, performance impact, and practical tuning for dedicated servers.

Benchmarks April 16, 2026 4 min read gigagpu

Cross-NUMA Memory Access Adds 40% Latency to Every Inference Request

Your dual-socket server has GPUs physically wired to CPU socket 0, but the inference process runs on cores attached to socket 1. Every memory access crosses the interconnect between NUMA nodes, adding 40-80ns per access compared to local memory. Over millions of accesses per forward pass, this compounds into measurable latency degradation. On a multi-socket GPU server, NUMA-unaware scheduling silently degrades AI inference performance by 15-30%.

Understand Your NUMA Topology

Before optimizing, map the relationship between CPUs, memory, and GPUs:

# Show NUMA topology
numactl --hardware
# Example output (dual-socket):
# available: 2 nodes (0-1)
# node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# node 0 size: 128000 MB
# node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
# node 1 size: 128000 MB
# node distances:
# node   0   1
#   0:  10  21
#   1:  21  10
# Distance 10 = local, 21 = remote (~2.1x slower)

# GPU-to-NUMA mapping
nvidia-smi topo -m
# Shows which NUMA node each GPU is closest to
# GPU0  GPU1  CPU Affinity  NUMA Affinity
# GPU0   X    NV12   0-15         0
# GPU1  NV12   X     0-15         0
# Both GPUs on NUMA node 0

# Alternative check
for gpu in /sys/bus/pci/devices/*/numa_node; do
    echo "$(dirname $gpu | xargs basename): NUMA $(cat $gpu)"
done

# lscpu for CPU layout
lscpu | grep -E "NUMA|Socket|Core|Thread"

Bind Inference to the Correct NUMA Node

Pin inference processes to the NUMA node closest to their GPUs:

# If GPUs are on NUMA node 0, bind inference to node 0
numactl --cpunodebind=0 --membind=0 \
    python3 -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-70B-Instruct \
    --tensor-parallel-size 2 \
    --port 8000

# --cpunodebind=0: Only run on CPUs from NUMA node 0
# --membind=0: Only allocate memory from NUMA node 0

# For systemd service files
# /etc/systemd/system/vllm-inference.service
[Service]
ExecStart=/usr/bin/numactl --cpunodebind=0 --membind=0 \
    /opt/envs/vllm/bin/python -m vllm.entrypoints.openai.api_server \
    --model /opt/models/llama-3-70b \
    --tensor-parallel-size 2 \
    --port 8000
Environment=CUDA_VISIBLE_DEVICES=0,1

# If GPUs span NUMA nodes (e.g., GPU0 on node 0, GPU2 on node 1)
# Use interleaved memory policy instead
numactl --interleave=0,1 \
    python3 -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-70B-Instruct \
    --tensor-parallel-size 4

# Verify binding after launch
taskset -cp $(pgrep -f vllm)
# Should show CPUs only from the target NUMA node
numastat -p $(pgrep -f vllm)
# Should show memory allocated primarily on target node

Measure NUMA Impact on Inference

# Benchmark: correct NUMA binding vs wrong NUMA binding

# Test 1: Bound to correct NUMA node (GPU's node)
echo "=== Correct NUMA binding ==="
numactl --cpunodebind=0 --membind=0 \
    python3 -c "
import torch, time
model = torch.nn.Linear(4096, 4096).half().cuda()
x = torch.randn(32, 4096, device='cuda', dtype=torch.float16)
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(10000):
    y = model(x)
torch.cuda.synchronize()
print(f'Time: {(time.perf_counter()-start)*1000:.1f}ms')
"

# Test 2: Bound to wrong NUMA node (remote)
echo "=== Wrong NUMA binding ==="
numactl --cpunodebind=1 --membind=1 \
    python3 -c "
import torch, time
model = torch.nn.Linear(4096, 4096).half().cuda()
x = torch.randn(32, 4096, device='cuda', dtype=torch.float16)
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(10000):
    y = model(x)
torch.cuda.synchronize()
print(f'Time: {(time.perf_counter()-start)*1000:.1f}ms')
"

# Expected: 10-30% performance difference
# The gap widens with:
# - Larger data transfers between CPU and GPU
# - More CPU-side processing (tokenization)
# - Smaller batch sizes (more frequent kernel launches)

# Monitor NUMA memory access patterns
numastat -m
# Watch for "Other Node" allocations — these are cross-NUMA

Multi-GPU NUMA Configuration

# Common server topologies:
#
# 2-socket, 4 GPUs: GPU 0,1 on NUMA 0 / GPU 2,3 on NUMA 1
# 2-socket, 8 GPUs: GPU 0-3 on NUMA 0 / GPU 4-7 on NUMA 1
# 1-socket, 8 GPUs: All GPUs on NUMA 0 (no NUMA concern)

# For tensor parallelism across NUMA nodes:
# Prefer GPUs within the same NUMA node for TP groups
# Cross-NUMA TP adds latency to all-reduce operations

# Good: TP across GPUs on same NUMA node
CUDA_VISIBLE_DEVICES=0,1 numactl --cpunodebind=0 --membind=0 \
    vllm serve model --tensor-parallel-size 2

# Less optimal: TP across NUMA nodes
CUDA_VISIBLE_DEVICES=0,2 numactl --interleave=0,1 \
    vllm serve model --tensor-parallel-size 2

# For pipeline parallelism: NUMA crossing is less impactful
# PP only sends activations between stages (smaller transfers)
# TP sends all-reduce traffic every layer (frequent, latency-sensitive)

# Multi-instance serving: one instance per NUMA node
# Instance 1 (NUMA 0, GPU 0,1)
numactl --cpunodebind=0 --membind=0 \
    CUDA_VISIBLE_DEVICES=0,1 \
    vllm serve model --port 8000 --tensor-parallel-size 2 &

# Instance 2 (NUMA 1, GPU 2,3)
numactl --cpunodebind=1 --membind=1 \
    CUDA_VISIBLE_DEVICES=2,3 \
    vllm serve model --port 8001 --tensor-parallel-size 2 &

# Load balance between instances with Nginx

Advanced NUMA Tuning

# Tune page migration policy
# Prevent kernel from auto-migrating pages across NUMA nodes
echo 0 | sudo tee /proc/sys/kernel/numa_balancing
# Auto-balancing sounds good but causes unpredictable latency spikes
# during page migration — disable for latency-sensitive inference

# Hugepages for reduced TLB misses
# Large models with mmap benefit from 2MB hugepages
echo 65536 | sudo tee /proc/sys/vm/nr_hugepages  # 128GB of hugepages
# Mount hugetlbfs
sudo mount -t hugetlbfs nodev /mnt/hugepages

# Verify hugepage usage
cat /proc/meminfo | grep -i huge

# CPU pinning with specific cores (finer than NUMA binding)
taskset -c 0-7 python3 -m vllm.entrypoints.openai.api_server \
    --model model --port 8000
# Pins to cores 0-7 specifically, avoiding scheduler migration

# IRQ affinity: steer network interrupts to the right NUMA node
# Find the IRQ for your network interface
cat /proc/interrupts | grep eth0
# Set affinity to NUMA node 0 CPUs
echo 0000ffff > /proc/irq/IRQ_NUMBER/smp_affinity
# Ensures network I/O processing stays on the same NUMA node as inference

# Complete NUMA-optimized launch script
#!/bin/bash
GPU_NUMA_NODE=0
CPUS="0-15"
GPUS="0,1"

echo 0 > /proc/sys/kernel/numa_balancing
numactl --cpunodebind=$GPU_NUMA_NODE --membind=$GPU_NUMA_NODE \
    taskset -c $CPUS \
    env CUDA_VISIBLE_DEVICES=$GPUS \
    /opt/envs/vllm/bin/python -m vllm.entrypoints.openai.api_server \
    --model /opt/models/llama-3-70b \
    --tensor-parallel-size 2 \
    --port 8000

NUMA-aware configuration extracts the full performance potential from multi-socket GPU servers. Deploy vLLM with proper CPU affinity using the production guide. Measure the impact against our token benchmarks. Monitor NUMA behavior with our GPU monitoring setup. Install PyTorch with our setup guide. Browse more benchmarks and infrastructure guides.

Multi-Socket GPU Servers

GigaGPU dedicated servers with dual-socket configurations and NVLink-equipped GPUs. NUMA-optimized for maximum AI inference performance.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

NUMA-Aware AI Inference Optimization

Cross-NUMA Memory Access Adds 40% Latency to Every Inference Request

Understand Your NUMA Topology

Bind Inference to the Correct NUMA Node

Measure NUMA Impact on Inference

Multi-GPU NUMA Configuration

Advanced NUMA Tuning

Multi-Socket GPU Servers

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

NUMA-Aware AI Inference Optimization

Cross-NUMA Memory Access Adds 40% Latency to Every Inference Request

Understand Your NUMA Topology

Bind Inference to the Correct NUMA Node

Measure NUMA Impact on Inference

Multi-GPU NUMA Configuration

Advanced NUMA Tuning

Multi-Socket GPU Servers

Need a Dedicated GPU Server?

gigagpu

Related Articles

RTX 4090 24GB FLUX.1-schnell Benchmark

Stable Diffusion 1.5 vs SDXL Speed by GPU

CUDA Graph Optimization for Inference

DeepSeek 7B on RTX 4060 Ti: Performance Benchmark & Cost, Category: Benchmarks, Slug: deepseek-7b-on-rtx-4060-ti-benchmark, Excerpt: DeepSeek 7B benchmarked on RTX 4060 Ti: 32.0 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?