Browse Source

Add nvidia-smi dmon hints

Jim Madge 3 years ago
parent
commit
68a935c534
1 changed files with 54 additions and 0 deletions
  1. 54 0
      docs/hpc.md

+ 54 - 0
docs/hpc.md

@@ -0,0 +1,54 @@
+# HPC
+
+## SLURM
+
+## Nvidia SMI
+
+When using as system with an Nvidia GPU, the `nvidia-smi` utility will likely be
+installed. This program can be used to monitor and manage for Nvidia devices.
+By default (*i.e.* with no arguments) the command will display a summary of
+devices, driver and CUDA version and GPU processes.
+
+By using the `dmon` command `nvidia-smi` can also be used to periodically print
+selected metrics, include GPU utilisation, GPU temperature and GPU memory
+utilisation, at regular intervals.
+
+```bash
+$ nvidia-smi dmon
+# gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
+# Idx     W     C     C     %     %     %     %   MHz   MHz
+    0    32    49     -     1     1     0     0  4006   974
+    0    32    49     -     2     2     0     0  4006   974
+```
+
+The columns displayed, format and interval can all be configured. The manpage of
+`nvidia-smi` gives full details (`man nvidia-smi`).
+
+Here is an example which could be incorporated into a SLURM script. This will
+display
+
+- Time and date
+- Power usage in Watts
+- GPU and memory temperature in C
+- Streaming multiprocessor, memory, encoder and decoder utilisation as a % of
+  maximum
+- Processor and memory clock speeds in MHz
+- PCIe throughput input (Rx) and output (Tx) in MB/s
+
+Every 300 seconds this information will be saved to a file named using the SLURM
+array job and task IDs as discussed in [the SLURM section](#slurm)
+
+This job is sent to the background and stopped after the `$command` has run.
+
+```bash
+...
+
+nvidia-smi dmon -o TD -s puct -d 300 > "dmon-${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}".txt &
+gpu_watch_pid=$!
+
+$command
+
+kill $gpu_watch_pid
+
+...
+```