Your Expensive GPU Sits Half-Idle During Inference
nvidia-smi shows GPU utilization hovering at 30-45% while your inference API serves requests. You expected near 100%. Half your compute investment is wasted every second the GPU idles between operations. Low utilization means something upstream is starving the GPU — the CPU, storage, network, data pipeline, or batch configuration is not feeding work fast enough. Diagnosing the specific stall is the first step to extracting full value from a dedicated GPU server.
Identify the Bottleneck
GPU utilization drops when the GPU waits for data or instructions from another component:
# Step 1: Capture baseline metrics
nvidia-smi dmon -s pucm -d 1 -f /tmp/gpu-metrics.csv &
# Columns: pwr, gtemp, sm%, mem%, enc, dec, mclk, pclk
# sm% = streaming multiprocessor utilization (compute)
# mem% = memory controller utilization (bandwidth)
# Step 2: Check CPU utilization during inference
top -b -n 5 | head -30
# If CPU is at 100% → CPU bottleneck (tokenization, preprocessing)
# If CPU is low → GPU is not being fed enough work
# Step 3: Check disk I/O
iostat -x 1 5
# If %util is high → disk I/O bottleneck (model loading, data read)
# Step 4: Check memory bandwidth vs compute
nvidia-smi --query-gpu=utilization.gpu,utilization.memory --format=csv -l 1
# High memory%, low gpu% → bandwidth-bound (normal for LLM decode)
# Low memory%, low gpu% → GPU is simply idle (pipeline stall)
# High gpu%, high memory% → GPU is working hard (good)
# Step 5: Monitor PCIe traffic
nvidia-smi dmon -s t -d 1
# High rx/tx → data transfer bottleneck
# Near zero → no data pipeline issue
Increase Effective Batch Size
Single-request inference vastly underutilizes GPU compute:
# Single request: GPU does one forward pass, then waits
# Batch of 8: GPU processes 8 requests simultaneously
# vLLM continuous batching (automatically batches concurrent requests)
vllm serve meta-llama/Llama-3-8B-Instruct \
--max-num-seqs 64 \
--max-num-batched-tokens 8192
# If you only have 1 request at a time, utilization WILL be low
# This is expected — LLM decode is memory-bound at batch=1
# Generate concurrent load to test max utilization
python3 -c "
import aiohttp, asyncio, time
async def send_request(session, prompt):
payload = {
'model': 'meta-llama/Llama-3-8B-Instruct',
'messages': [{'role': 'user', 'content': prompt}],
'max_tokens': 128
}
async with session.post('http://localhost:8000/v1/chat/completions',
json=payload) as resp:
return await resp.json()
async def main():
async with aiohttp.ClientSession() as session:
# Send 32 concurrent requests
tasks = [send_request(session, f'Count to {i}') for i in range(32)]
start = time.time()
results = await asyncio.gather(*tasks)
elapsed = time.time() - start
print(f'32 requests in {elapsed:.1f}s')
asyncio.run(main())
"
# GPU utilization should jump to 80-95% under concurrent load
Fix Data Pipeline Stalls
# For training and batch processing: data loading is often the bottleneck
# Bad: synchronous data loading on CPU
for batch in dataloader: # CPU loads batch while GPU waits
output = model(batch.to("cuda"))
# Good: prefetch data with multiple workers
from torch.utils.data import DataLoader
dataloader = DataLoader(
dataset,
batch_size=32,
num_workers=8, # Parallel data loading
pin_memory=True, # Faster CPU→GPU transfer
prefetch_factor=4, # Preload 4 batches per worker
persistent_workers=True # Don't restart workers each epoch
)
# For inference APIs: tokenization can stall the GPU
# Profile tokenization time vs inference time
import time
# Tokenization (CPU)
start = time.time()
tokens = tokenizer(prompt, return_tensors="pt").to("cuda")
tokenize_time = time.time() - start
# Inference (GPU)
start = time.time()
output = model.generate(**tokens, max_new_tokens=128)
inference_time = time.time() - start
print(f"Tokenize: {tokenize_time*1000:.0f}ms")
print(f"Inference: {inference_time*1000:.0f}ms")
# If tokenize_time > 20% of inference_time, CPU is a bottleneck
Common Causes of Low Utilization
# Cause 1: Model too small for the GPU
# A 1B model on an RTX 6000 Pro will never hit 90% utilization at batch=1
# Fix: Use a larger model, or batch more requests
# Cause 2: Synchronous Python operations between GPU calls
# Fix: Use async frameworks, avoid torch.cuda.synchronize() in hot paths
# Cause 3: CPU-bound preprocessing
# Fix: Move preprocessing to GPU, use faster tokenizers
# pip install tokenizers # Rust-based, 10x faster than Python
# Cause 4: Inefficient attention implementation
# Fix: Ensure Flash Attention is enabled
python3 -c "
from vllm import LLM
# vLLM uses Flash Attention by default if available
# For manual PyTorch:
# pip install flash-attn
import flash_attn
print(f'Flash Attention version: {flash_attn.__version__}')
"
# Cause 5: Memory fragmentation limiting batch size
# vLLM: check GPU memory allocation
curl -s http://localhost:8000/metrics | grep gpu_cache
# If gpu_cache_usage_perc is low, increase --gpu-memory-utilization
vllm serve model --gpu-memory-utilization 0.95
# Cause 6: Power throttling reducing clocks
nvidia-smi --query-gpu=clocks_throttle_reasons.active --format=csv
# Fix: See GPU power management guide
Set Up Utilization Monitoring
# Continuous monitoring to catch utilization drops
nvidia-smi --query-gpu=timestamp,index,utilization.gpu,utilization.memory,\
power.draw,temperature.gpu --format=csv -l 10 \
| tee /var/log/gpu-utilization.csv
# Quick utilization summary
nvidia-smi --query-gpu=index,utilization.gpu,utilization.memory \
--format=csv -l 1
# Alert on sustained low utilization
cat <<'EOF' > /opt/scripts/check-gpu-util.sh
#!/bin/bash
THRESHOLD=30
GPU_UTIL=$(nvidia-smi --query-gpu=utilization.gpu \
--format=csv,noheader,nounits -i 0)
if [ "$GPU_UTIL" -lt "$THRESHOLD" ]; then
logger -p user.warning -t "gpu-util" \
"GPU 0 utilization at ${GPU_UTIL}% (threshold: ${THRESHOLD}%)"
fi
EOF
Low GPU utilization is a symptom, not a disease — find and fix the actual bottleneck. Maximize throughput on your GPU server with continuous batching in vLLM using the production guide. Track utilization with our monitoring setup. See throughput baselines in our token benchmarks. Browse more benchmarks, infrastructure guides, and tutorials.
Maximize GPU ROI
GigaGPU dedicated servers with NVIDIA GPUs designed for sustained AI workloads. Full root access to tune every parameter for peak utilization.
Browse GPU Servers