When using as system with an Nvidia GPU, the nvidia-smi utility will likely be
installed. This program can be used to monitor and manage for Nvidia devices.
By default (i.e. with no arguments) the command will display a summary of
devices, driver and CUDA version and GPU processes.
By using the dmon command nvidia-smi can also be used to periodically print
selected metrics, include GPU utilisation, GPU temperature and GPU memory
utilisation, at regular intervals.
$ nvidia-smi dmon
# gpu pwr gtemp mtemp sm mem enc dec mclk pclk
# Idx W C C % % % % MHz MHz
0 32 49 - 1 1 0 0 4006 974
0 32 49 - 2 2 0 0 4006 974
The columns displayed, format and interval can all be configured. The manpage of
nvidia-smi gives full details (man nvidia-smi).
Here is an example which could be incorporated into a SLURM script. This will display
Every 300 seconds this information will be saved to a file named using the SLURM array job and task IDs as discussed in the SLURM section
This job is sent to the background and stopped after the $command has run.
...
nvidia-smi dmon -o TD -s puct -d 300 > "dmon-${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}".txt &
gpu_watch_pid=$!
$command
kill $gpu_watch_pid
...