Every dedicated GPU operator knows nvidia-smi. Most use only the default summary view. The tool has far more capability – process tracking, ECC error reporting, topology queries, and scriptable output. On our dedicated GPU hosting these less-used modes are worth knowing.
Contents
Process Listing
nvidia-smi pmon -c 1
Shows per-process GPU usage: PID, process name, utilisation, memory. Useful for finding which vLLM replica is using which GPU on a multi-process server.
nvidia-smi --query-compute-apps=pid,process_name,gpu_bus_id,used_memory --format=csv
Machine-Readable
For scripting or monitoring:
nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,memory.used,memory.total \
--format=csv,noheader,nounits
Output is parsable CSV. Good for one-off checks without setting up Prometheus.
Topology
nvidia-smi topo -m
Shows how GPUs connect – PCIe root complex, NUMA node, interconnect type. Critical for multi-GPU tensor-parallel setups: two GPUs on the same NUMA node communicate faster than cross-socket pairs.
Continuous Logging
nvidia-smi dmon -s u,m,p,t -c 300 > gpu-log.csv
Samples utilisation, memory, power, and temperature every second for 300 samples. Useful for post-mortem on a load test or identifying thermal throttling.
Flags:
-s u: utilisation-s m: memory-s p: power-s t: temperature-s c: clocks-s e: ECC errors
GPU Server Tooling Ready
Preinstalled nvidia-smi, DCGM, and Prometheus on UK dedicated GPU hosting.
Browse GPU ServersSee DCGM Exporter and GPU power management.