|
@@ -0,0 +1,54 @@
|
|
|
|
+# HPC
|
|
|
|
+
|
|
|
|
+## SLURM
|
|
|
|
+
|
|
|
|
+## Nvidia SMI
|
|
|
|
+
|
|
|
|
+When using as system with an Nvidia GPU, the `nvidia-smi` utility will likely be
|
|
|
|
+installed. This program can be used to monitor and manage for Nvidia devices.
|
|
|
|
+By default (*i.e.* with no arguments) the command will display a summary of
|
|
|
|
+devices, driver and CUDA version and GPU processes.
|
|
|
|
+
|
|
|
|
+By using the `dmon` command `nvidia-smi` can also be used to periodically print
|
|
|
|
+selected metrics, include GPU utilisation, GPU temperature and GPU memory
|
|
|
|
+utilisation, at regular intervals.
|
|
|
|
+
|
|
|
|
+```bash
|
|
|
|
+$ nvidia-smi dmon
|
|
|
|
+# gpu pwr gtemp mtemp sm mem enc dec mclk pclk
|
|
|
|
+# Idx W C C % % % % MHz MHz
|
|
|
|
+ 0 32 49 - 1 1 0 0 4006 974
|
|
|
|
+ 0 32 49 - 2 2 0 0 4006 974
|
|
|
|
+```
|
|
|
|
+
|
|
|
|
+The columns displayed, format and interval can all be configured. The manpage of
|
|
|
|
+`nvidia-smi` gives full details (`man nvidia-smi`).
|
|
|
|
+
|
|
|
|
+Here is an example which could be incorporated into a SLURM script. This will
|
|
|
|
+display
|
|
|
|
+
|
|
|
|
+- Time and date
|
|
|
|
+- Power usage in Watts
|
|
|
|
+- GPU and memory temperature in C
|
|
|
|
+- Streaming multiprocessor, memory, encoder and decoder utilisation as a % of
|
|
|
|
+ maximum
|
|
|
|
+- Processor and memory clock speeds in MHz
|
|
|
|
+- PCIe throughput input (Rx) and output (Tx) in MB/s
|
|
|
|
+
|
|
|
|
+Every 300 seconds this information will be saved to a file named using the SLURM
|
|
|
|
+array job and task IDs as discussed in [the SLURM section](#slurm)
|
|
|
|
+
|
|
|
|
+This job is sent to the background and stopped after the `$command` has run.
|
|
|
|
+
|
|
|
|
+```bash
|
|
|
|
+...
|
|
|
|
+
|
|
|
|
+nvidia-smi dmon -o TD -s puct -d 300 > "dmon-${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}".txt &
|
|
|
|
+gpu_watch_pid=$!
|
|
|
|
+
|
|
|
|
+$command
|
|
|
|
+
|
|
|
|
+kill $gpu_watch_pid
|
|
|
|
+
|
|
|
|
+...
|
|
|
|
+```
|