When using as system with an Nvidia GPU, the nvidia-smi
utility will likely be
installed. This program can be used to monitor and manage for Nvidia devices.
By default (i.e. with no arguments) the command will display a summary of
devices, driver and CUDA version and GPU processes.
By using the dmon
command nvidia-smi
can also be used to periodically print
selected metrics, include GPU utilisation, GPU temperature and GPU memory
utilisation, at regular intervals.
$ nvidia-smi dmon
# gpu pwr gtemp mtemp sm mem enc dec mclk pclk
# Idx W C C % % % % MHz MHz
0 32 49 - 1 1 0 0 4006 974
0 32 49 - 2 2 0 0 4006 974
The columns displayed, format and interval can all be configured. The manpage of
nvidia-smi
gives full details (man nvidia-smi
).
Here is an example which could be incorporated into a SLURM script. This will display
Every 300 seconds this information will be saved to a file named using the SLURM array job and task IDs as discussed in the SLURM section
This job is sent to the background and stopped after the $command
has run.
...
nvidia-smi dmon -o TD -s puct -d 300 > "dmon-${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}".txt &
gpu_watch_pid=$!
$command
kill $gpu_watch_pid
...
When running these workflows on HPC you will most likely use the Slurm scheduler to submit, monitor and manage your jobs.
The Slurm website provide a users tutorial and documentation which have comprehensive detail of Slurm and its commands.
In particular interest to users are
This section does not aim to be a comprehensive guide to Slurm, or even a brief introduction. Instead, it is intended to provide suggestions and a template for running this projects workflows on a cluster with Slurm.