# HPC ## SLURM ## Nvidia SMI When using as system with an Nvidia GPU, the `nvidia-smi` utility will likely be installed. This program can be used to monitor and manage for Nvidia devices. By default (*i.e.* with no arguments) the command will display a summary of devices, driver and CUDA version and GPU processes. By using the `dmon` command `nvidia-smi` can also be used to periodically print selected metrics, include GPU utilisation, GPU temperature and GPU memory utilisation, at regular intervals. ```bash $ nvidia-smi dmon # gpu pwr gtemp mtemp sm mem enc dec mclk pclk # Idx W C C % % % % MHz MHz 0 32 49 - 1 1 0 0 4006 974 0 32 49 - 2 2 0 0 4006 974 ``` The columns displayed, format and interval can all be configured. The manpage of `nvidia-smi` gives full details (`man nvidia-smi`). Here is an example which could be incorporated into a SLURM script. This will display - Time and date - Power usage in Watts - GPU and memory temperature in C - Streaming multiprocessor, memory, encoder and decoder utilisation as a % of maximum - Processor and memory clock speeds in MHz - PCIe throughput input (Rx) and output (Tx) in MB/s Every 300 seconds this information will be saved to a file named using the SLURM array job and task IDs as discussed in [the SLURM section](#slurm) This job is sent to the background and stopped after the `$command` has run. ```bash ... nvidia-smi dmon -o TD -s puct -d 300 > "dmon-${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}".txt & gpu_watch_pid=$! $command kill $gpu_watch_pid ... ```