Home / Blog / Benchmarks / Disk I/O Bottleneck: When Storage Slows GPU

Benchmarks

Disk I/O Bottleneck: When Storage Slows GPU

Diagnose and fix disk I/O bottlenecks on GPU servers. Covers model loading delays, NVMe optimization, RAM caching, mmap loading, training data pipelines, and storage configuration for AI inference.

Benchmarks April 16, 2026 4 min read admin

Model Loading Takes 5 Minutes Because Your Disk Cannot Keep Up

Loading a 140GB model from a SATA SSD takes over four minutes. Swapping models for different requests means minutes of downtime. Training data pipelines stall because the disk read throughput cannot match what the GPU processes. Storage is the forgotten bottleneck in AI infrastructure — when a GPU server can compute at teraflops but reads data at megabytes per second, the entire pipeline grinds to the speed of the slowest component.

Diagnose Disk I/O Bottlenecks

# Step 1: Check disk I/O during model loading or training
iostat -x 1 10
# Key columns:
# %util  — 100% means disk is saturated
# r/s    — reads per second
# rMB/s  — read throughput in MB/s
# await  — average I/O wait time in ms (>10ms is slow)

# Step 2: Identify which process is doing I/O
iotop -ao
# Shows cumulative I/O per process

# Step 3: Check what the GPU is waiting for
nvidia-smi dmon -s u -d 1
# If GPU util is near 0% during model load → disk bottleneck

# Step 4: Benchmark your actual disk speed
# Sequential read (model loading pattern)
fio --name=seq-read --ioengine=libaio --direct=1 --bs=1M \
    --size=10G --numjobs=1 --rw=read --filename=/opt/models/fio-test

# Random read (training data pattern)
fio --name=rand-read --ioengine=libaio --direct=1 --bs=4k \
    --size=1G --numjobs=4 --rw=randread --filename=/opt/models/fio-test

# Expected throughput:
# SATA SSD:     ~500 MB/s sequential
# NVMe Gen3:    ~3,500 MB/s sequential
# NVMe Gen4:    ~7,000 MB/s sequential
# NVMe Gen4 RAID0: ~14,000 MB/s sequential
# RAM (tmpfs):  ~20,000+ MB/s

# A 140GB model loads in:
# SATA SSD:   ~280 seconds (4.7 minutes)
# NVMe Gen4:  ~20 seconds
# RAID0 NVMe: ~10 seconds
# RAM:        ~7 seconds

Optimize NVMe Storage

# Check current NVMe configuration
nvme list
lsblk -o NAME,SIZE,TYPE,MOUNTPOINT,ROTA
# ROTA=0 means SSD, ROTA=1 means spinning disk

# Set I/O scheduler to none (best for NVMe)
echo "none" | sudo tee /sys/block/nvme0n1/queue/scheduler

# Increase readahead for large sequential reads (model loading)
sudo blockdev --setra 8192 /dev/nvme0n1
# 8192 sectors = 4MB readahead (up from default 128-256)

# Verify NVMe is running at full PCIe speed
sudo nvme smart-log /dev/nvme0n1
lspci -vv | grep -A 10 "Non-Volatile"
# Check LnkSta for Speed and Width

# Enable write-back caching for better write performance
sudo hdparm -W1 /dev/nvme0n1 2>/dev/null

# Mount with optimal flags for AI workloads
# /etc/fstab entry for model storage
/dev/nvme0n1p1 /opt/models ext4 noatime,nodiratime,discard 0 2

Use RAM to Eliminate Disk Reads

# If you have enough RAM, cache models in memory

# Option 1: tmpfs — RAM-backed filesystem
sudo mount -t tmpfs -o size=160g tmpfs /opt/models-fast/
cp /opt/models/llama-3-70b/* /opt/models-fast/llama-3-70b/
# Now model loads from RAM speed

# Option 2: Let Linux page cache handle it
# First load reads from disk; subsequent loads come from RAM
# Ensure enough free RAM:
free -h
# If "available" > model size, second load will be near-instant

# Pre-warm page cache for critical models
cat /opt/models/llama-3-70b/*.safetensors > /dev/null
vmtouch -t /opt/models/llama-3-70b/  # Pin in cache
# Requires: apt install vmtouch

# Option 3: Memory-mapped file loading
# Most frameworks support mmap for model loading
# Python: model loads lazily, pages brought in on access
# See memory-mapped model loading guide for details

# Monitor page cache effectiveness
free -m
cat /proc/meminfo | grep -E "Cached|Buffers|MemAvailable"

Fix Training Data Pipeline Stalls

# Training repeatedly reads dataset files from disk
# If disk can't keep up, GPU starves between batches

# Measure data loading time vs training time
import torch, time
from torch.utils.data import DataLoader

dataloader = DataLoader(dataset, batch_size=32, num_workers=0)

for batch in dataloader:
    load_start = time.time()
    batch = {k: v.to("cuda") for k, v in batch.items()}
    load_time = time.time() - load_start

    train_start = time.time()
    loss = model(**batch).loss
    loss.backward()
    optimizer.step()
    torch.cuda.synchronize()
    train_time = time.time() - train_start

    print(f"Load: {load_time*1000:.0f}ms  Train: {train_time*1000:.0f}ms")
    # If load_time > train_time, disk is the bottleneck

# Fix: Increase DataLoader workers and prefetch
dataloader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=8,
    pin_memory=True,
    prefetch_factor=4,
    persistent_workers=True
)

# Fix: Convert dataset to streaming format
# Arrow/Parquet reads are 10x faster than random JSON/CSV access
# Use datasets library with memory-mapping
from datasets import load_dataset
ds = load_dataset("parquet", data_files="train.parquet")
# Memory-mapped: only reads pages as needed

RAID Configuration for Maximum Throughput

# RAID 0 across multiple NVMe drives multiplies bandwidth
# WARNING: RAID 0 has no redundancy — pair with backups

# Create RAID 0 with mdadm
sudo mdadm --create /dev/md0 --level=0 --raid-devices=2 \
    /dev/nvme0n1 /dev/nvme1n1

sudo mkfs.ext4 /dev/md0
sudo mount /dev/md0 /opt/models

# Benchmark RAID vs single drive
fio --name=raid-seq --ioengine=libaio --direct=1 --bs=1M \
    --size=10G --numjobs=1 --rw=read --filename=/opt/models/fio-test
# Expect ~2x single-drive throughput with 2-drive RAID 0

# Save RAID configuration
sudo mdadm --detail --scan >> /etc/mdadm/mdadm.conf
sudo update-initramfs -u

Storage throughput sets the floor for model loading and data pipeline speed on your GPU server. Pair fast NVMe storage with vLLM deployed via the production guide. Benchmark end-to-end throughput against our token benchmarks. Monitor disk alongside GPU with our monitoring setup. Browse more benchmarks, infrastructure guides, and tutorials.

NVMe-Equipped GPU Servers

GigaGPU dedicated servers with high-speed NVMe storage matched to NVIDIA GPUs. Load models in seconds, not minutes.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Disk I/O Bottleneck: When Storage Slows GPU

Model Loading Takes 5 Minutes Because Your Disk Cannot Keep Up

Diagnose Disk I/O Bottlenecks

Optimize NVMe Storage

Use RAM to Eliminate Disk Reads

Fix Training Data Pipeline Stalls

RAID Configuration for Maximum Throughput

NVMe-Equipped GPU Servers

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Disk I/O Bottleneck: When Storage Slows GPU

Model Loading Takes 5 Minutes Because Your Disk Cannot Keep Up

Diagnose Disk I/O Bottlenecks

Optimize NVMe Storage

Use RAM to Eliminate Disk Reads

Fix Training Data Pipeline Stalls

RAID Configuration for Maximum Throughput

NVMe-Equipped GPU Servers

Need a Dedicated GPU Server?

admin

Related Articles

How Many Embedding Requests per GPU per Second?

Flux.1 on RTX 5080: Images/sec & VRAM Usage, Category: Benchmarks, Slug: flux-1-on-rtx-5080-benchmark, Excerpt: Flux.1 benchmarked on RTX 5080: 1.25 it/s, 3.75 images/min at 1024×1024, VRAM usage, and cost per 1K images., Internal links: 8 –>

Phi-3 Mini on RTX 5080: Performance Benchmark & Cost, Category: Benchmarks, Slug: phi-3-mini-on-rtx-5080-benchmark, Excerpt: Phi-3 Mini benchmarked on RTX 5080: 82 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Coqui XTTS-v2 on RTX 5080: TTS Speed & Cost, Category: Benchmarks, Slug: coqui-xtts-v2-on-rtx-5080-benchmark, Excerpt: Coqui XTTS-v2 benchmarked on RTX 5080: RTF 0.12, 8.3x real-time processing, VRAM usage, and cost per audio hour., Internal links: 8 –>

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?