RTX 3050 - Order Now
Home / Blog / Benchmarks / GPU Profiling with nvidia-smi & Nsight
Benchmarks

GPU Profiling with nvidia-smi & Nsight

Profile GPU workloads with nvidia-smi and Nsight tools. Covers utilization monitoring, kernel-level profiling, memory analysis, bottleneck identification, and actionable optimization for AI inference servers.

You Are Optimizing the Wrong Thing Without Profiling Data

You assumed inference was compute-bound and spent hours optimizing batch sizes. Profiling reveals 70% of time is spent in memory copies, not matrix multiplication. Without profiling, every optimization is a guess. nvidia-smi gives surface-level utilization numbers. Nsight Compute and Nsight Systems expose kernel-level behavior — exactly which operations consume time and why. Effective profiling on a dedicated GPU server turns guesswork into targeted optimization.

nvidia-smi: Quick Health and Utilization

Start with nvidia-smi for a system-level view before diving deeper:

# Snapshot view
nvidia-smi

# Continuous monitoring (every 1 second)
nvidia-smi dmon -s pucm -d 1
# p=power, u=utilization, c=clocks, m=memory
# Columns: pwr, gtemp, sm%, mem%, enc, dec, mclk, pclk

# Query specific metrics
nvidia-smi --query-gpu=index,utilization.gpu,utilization.memory,\
memory.used,memory.total,power.draw,temperature.gpu,clocks.gr \
    --format=csv -l 1

# Process-level GPU usage
nvidia-smi pmon -d 1
# Shows per-process: SM%, memory%, encoder%, decoder%

# Identify which process uses which GPU
nvidia-smi --query-compute-apps=pid,process_name,used_gpu_memory \
    --format=csv

# Key indicators from nvidia-smi:
# High SM%, high mem%  → GPU is working hard (good)
# High SM%, low mem%   → Compute-bound workload
# Low SM%, high mem%   → Bandwidth-bound (LLM decode)
# Low SM%, low mem%    → GPU is idle (pipeline stall)

Nsight Systems: Timeline Profiling

Nsight Systems shows the complete timeline of CPU, GPU, and data transfer activity:

# Install Nsight Systems
sudo apt install -y nsight-systems

# Profile an inference script
nsys profile --trace=cuda,nvtx,osrt \
    --output /tmp/inference-profile \
    python3 run_inference.py

# Profile vLLM serving (capture 30 seconds)
nsys profile --trace=cuda,nvtx \
    --duration 30 \
    --output /tmp/vllm-profile \
    python3 -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-8B-Instruct --port 8000

# View results (generates .nsys-rep file)
# Transfer to a machine with Nsight Systems GUI, or use CLI:
nsys stats /tmp/inference-profile.nsys-rep

# Key things to look for in the timeline:
# 1. GPU idle gaps between kernels (pipeline stalls)
# 2. Long CPU sections between GPU launches (CPU bottleneck)
# 3. Large cudaMemcpy blocks (data transfer overhead)
# 4. Kernel launch latency (many small kernels = overhead)

# Export summary statistics
nsys stats --report cuda_gpu_kern_sum /tmp/inference-profile.nsys-rep
# Shows: kernel name, total time, count, average duration
# The longest kernels are your optimization targets

Nsight Compute: Kernel-Level Analysis

# Nsight Compute profiles individual CUDA kernels in detail
# WARNING: 100-1000x slowdown — profile small workloads only

# Install
sudo apt install -y nsight-compute

# Profile specific kernels
ncu --set full \
    --target-processes all \
    --launch-count 10 \
    --output /tmp/kernel-profile \
    python3 single_inference.py

# Quick summary without full analysis
ncu --metrics \
sm__throughput.avg.pct_of_peak_sustained_elapsed,\
dram__throughput.avg.pct_of_peak_sustained_elapsed,\
gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed \
    python3 single_inference.py

# Key Nsight Compute metrics:
#
# sm__throughput: Compute utilization (% of peak TFLOPS)
#   > 60% → compute-bound, optimize arithmetic
#   < 30% → likely memory-bound or stalled
#
# dram__throughput: Memory bandwidth utilization
#   > 70% → bandwidth-bound, reduce data movement
#   < 30% → underutilizing memory, check access patterns
#
# Achieved occupancy: Warps active vs maximum
#   < 50% → not enough parallelism, increase batch size
#
# L2 cache hit rate:
#   Low → data doesn't fit in cache, review access patterns

# Filter for specific kernel names
ncu --kernel-name "gemm" --launch-count 5 python3 inference.py

PyTorch Built-in Profiler

import torch
from torch.profiler import profile, ProfilerActivity, schedule

# Profile inference with PyTorch profiler
with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=schedule(wait=2, warmup=2, active=6, repeat=1),
    record_shapes=True,
    profile_memory=True,
    with_stack=True
) as prof:
    for step in range(10):
        with torch.no_grad():
            output = model(input_tensor)
        torch.cuda.synchronize()
        prof.step()

# Print top operations by GPU time
print(prof.key_averages().table(
    sort_by="cuda_time_total",
    row_limit=20
))

# Export for Chrome trace viewer
prof.export_chrome_trace("/tmp/trace.json")
# Open chrome://tracing and load the file

# Export for TensorBoard
prof.export_stacks("/tmp/profiler_stacks.txt", "self_cuda_time_total")

# Key columns in the table:
# Self CUDA Time: Time spent in this operation (excluding children)
# CUDA Time Total: Total time including child operations
# CPU Time: Time on CPU (scheduling, data movement)
# Calls: Number of invocations
# Input Shapes: Tensor dimensions (helps spot unaligned shapes)

From Profile to Optimization

# Profile finding → Action
#
# Large cudaMemcpy time:
# → Move data to GPU earlier, use pinned memory, reduce transfers
# → torch.cuda.Stream for overlapping compute and transfer
#
# Many small kernel launches:
# → Use torch.compile() to fuse operations
# → Enable CUDA Graphs (eliminates launch overhead)
#
# Low SM utilization with high memory throughput:
# → Bandwidth-bound: quantize model, reduce precision
# → Normal for LLM decode — focus on batching
#
# High SM utilization but low throughput:
# → Check for Tensor Core usage (FP16/BF16)
# → Verify matrix dimensions are multiples of 8
#
# CPU time between GPU kernels:
# → Python overhead: use torch.compile or C++ inference
# → Tokenization bottleneck: use fast tokenizers
#
# Memory allocation spikes:
# → Pre-allocate tensors, use memory pools
# → torch.cuda.memory.set_per_process_memory_fraction()

# Quick profiling one-liner for identifying bottleneck type
python3 -c "
import torch, time
x = torch.randn(4096, 4096, device='cuda', dtype=torch.float16)
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(100):
    y = torch.mm(x, x)
torch.cuda.synchronize()
compute_time = time.perf_counter() - start
print(f'MatMul throughput: {2*4096**3*100/compute_time/1e12:.1f} TFLOPS')
print(f'Expected RTX 6000 Pro FP16: ~300 TFLOPS')
print(f'Achieved/Expected: {2*4096**3*100/compute_time/1e12/312*100:.0f}%')
"

Profiling transforms GPU optimization from guessing to engineering on your GPU server. Profile vLLM deployments set up with the production guide. Baseline against our token benchmarks. Monitor production metrics with our GPU monitoring setup. Browse more benchmarks, infrastructure guides, and tutorials.

Profile-Ready GPU Servers

GigaGPU dedicated servers with NVIDIA GPUs and full root access. Install profiling tools, optimize kernels, and maximize throughput.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?