| 
					
				 | 
			
			
				@@ -0,0 +1,323 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# HPC 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+## Nvidia SMI 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+When using as system with an Nvidia GPU, the `nvidia-smi` utility will likely be 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+installed. This program can be used to monitor and manage for Nvidia devices. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+By default (*i.e.* with no arguments) the command will display a summary of 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+devices, driver and CUDA version and GPU processes. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+By using the `dmon` command `nvidia-smi` can also be used to periodically print 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+selected metrics, include GPU utilisation, GPU temperature and GPU memory 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+utilisation, at regular intervals. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+```bash 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+$ nvidia-smi dmon 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Idx     W     C     C     %     %     %     %   MHz   MHz 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    0    32    49     -     1     1     0     0  4006   974 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+    0    32    49     -     2     2     0     0  4006   974 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+The columns displayed, format and interval can all be configured. The manpage of 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+`nvidia-smi` gives full details (`man nvidia-smi`). 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Here is an example which could be incorporated into a Slurm script. This will 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+display 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- Time and date 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- Power usage in Watts 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- GPU and memory temperature in C 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- Streaming multiprocessor, memory, encoder and decoder utilisation as a % of 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+  maximum 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- Processor and memory clock speeds in MHz 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- PCIe throughput input (Rx) and output (Tx) in MB/s 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Every 300 seconds this information will be saved to a file named using the 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Slurm array job and task IDs as discussed in [the Slurm 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+section](#parametrising-job-arrays) 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+This job is sent to the background and stopped after the `$COMMAND` has run. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+```bash 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+nvidia-smi dmon -o TD -s puct -d 300 > "dmon-${Slurm_ARRAY_JOB_ID}_${Slurm_ARRAY_TASK_ID}".txt & 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+GPU_WATCH_PID=$! 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+$COMMAND 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+kill $GPU_WATCH_PID 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+## Slurm 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+When running these workflows on HPC you will most likely use the 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+[Slurm](https://www.schedmd.com/) scheduler to submit, monitor and manage your 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+jobs. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+The Slurm website provide a users 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+[tutorial](https://slurm.schedmd.com/tutorials.html) and 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+[documentation](https://slurm.schedmd.com/documentation.html) which have 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+comprehensive detail of Slurm and its commands. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+In particular interest to users are 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- [Slurm command man pages](https://slurm.schedmd.com/man_index.html) 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- [Slurm command summary cheat 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+  sheet](https://slurm.schedmd.com/pdfs/summary.pdf) 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- [Array support overview](https://slurm.schedmd.com/job_array.html) 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+This section does not aim to be a comprehensive guide to Slurm, or even a brief 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+introduction. Instead, it is intended to provide suggestions and a template for 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+running this projects workflows on a cluster with Slurm. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+### Requesting GPUs 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+To request GPUs for a job in Slurm you may use the [Generic Resource 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+(GRES)](https://slurm.schedmd.com/gres.html#Running_Jobs) plugin. The precise 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+details of this will depend on the cluster you are using (for example 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+requesting a particular model of GPU), however in most cases you will be able 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+to request `n` GPUs with the flag `--gres=gpu:n`. For example 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+```bash 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+$ srun --gres=gpu:1 my_program 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Submitted batch job 42 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+$ sbatch --gres=gpu:4 script.sh 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Submitted batch job 43 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Or in a batch script 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+```bash 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+##Slurm --gres=gpu:1 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+### Benchmarking 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+A rudimentary way to monitor performance is to measure how long a given task 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+takes to complete. One way to do achieve this, if the software you are running 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+provides no other way, is to run the `date` command before and after your 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+program. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+```bash 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+date --iso-8601=seconds --utc 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+$COMMAND 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+date --iso-8601=seconds --utc 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+The flag and parameter `--iso-8601=seconds` ensures the output is in the ISO 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+8601 format with precision up to and including seconds. The `--utc` flag means 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+that the time will be printed in Coordinated Universal Time. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+The programs start and end times will then be recorded in the STDOUT file. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+### Repeated runs (job arrays) 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+If you are assessing a systems performance you will likely want to repeat the 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+same calculation a number of times until you are satisfied with you estimate of 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+mean performance. It would be possible to simply repeatedly submit the same job 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+and many people are tempted to engineer their own scripts to do so. However, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Slurm provides a way to submit groups of jobs that you will most likely find 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+more convenient. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+When submitting a job with `sbatch` you can specify the size of your job array 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+with the `--array=` flag using a range of numbers *e.g* `0-9` or a comma 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+separated list *e.g.* `1,2,3`. You can use `:` with a range to specify a stride, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+for example `1-5:2` is equivalent to `1,3,5`. You may also specify the maximum 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+number of jobs from an array that may run simultaneously using `%` *e.g.* 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+`0-31%4`. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Here are some examples 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+```bash 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Submit 10 jobs with indices 1,2,3,..,10 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+sbatch --array=1-10 script.sh 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Submit 5 jobs with indices 1, 4, 8, 12, 16 and at most two of these running 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# simultaneously 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+sbatch --array=1-16:4%2 script.sh 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+### Parametrising job arrays 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+One particularly powerful way to use job arrays is through parametrising the 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+individual tasks. For example, this could be used to sweep over a set of input 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+parameters or data sets. As with using job array for repeating jobs, this will 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+likely be more convenient than implementing your own solution. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Within your batch script you will have access to the following environment 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+variables 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+| environment variable     | value                    | 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+|--------------------------|--------------------------| 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+| `Slurm_ARRAY_JOB_ID`     | job id of the first task | 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+| `Slurm_ARRAY_TASK_ID`    | current task index       | 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+| `Slurm_ARRAY_TASK_COUNT` | total number of tasks    | 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+| `Slurm_ARRAY_TASK_MAX`   | the highest index value  | 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+| `Slurm_ARRAY_TASK_MIN`   | the lowest index value   | 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+For example, if you submitted a job array with the command 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+```bash 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+$ sbatch --array=0-12:4 script.sh 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Submitted batch job 42 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+then the job id of the first task is `42` and the four jobs will have 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+`Slurm_ARRAY_JOB_ID`, `Slurm_ARRAY_TASK_ID` pairs of 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- 42, 0 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- 42, 4 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- 42, 8 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- 42, 12 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+The environment variables can be used in your commands. For example 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+```bash 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+my_program -n $Slurm_ARRAY_TASK_ID -o output_${Slurm_ARRAY_JOB_ID}_${Slurm_ARRAY_TASK_ID} 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+with the same `sbatch` command as before, the following commands would be 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+executed in your jobs (one in each job) 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- `my_program -n 0 -o output_42_0` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- `my_program -n 4 -o output_42_4` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- `my_program -n 8 -o output_42_8` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+- `my_program -n 12 -o output_42_12` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+### Using scratch space 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Most HPC systems will offer some sort of fast, temporal and typically on-node, 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+storage such as NVMe SSDs. In calculations where reading or writing data is a 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+bottleneck, using this storage will be key to optimising performance. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+The details of this scratch space will differ between HPC system and changes 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+will need to be made when transferring workflows between systems. However, a 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+combination of templating and singularity binds can make these adjustments less 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+tedious and more robust. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+The following snippet shows how this may be done. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+```bash 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Path to scratch disk on host 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+HOST_SCRATCH_PATH=/scratch 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Path to input data on host 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+INPUT_DATA=/path/to/input/data 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Get name of input data directory 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+INPUT_DIR=$(basename $INPUT_DATA) 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Path to place output data on host 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+OUTPUT_DIR=/path/to/output/dir 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Create a directory on scratch disk for this job 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+JOB_SCRATCH_PATH=$HOST_SCRATCH_PATH/${Slurm_JOB_NAME}_${Slurm_ARRAY_JOB_ID}_${Slurm_ARRAY_TASK_ID} 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+mkdir -p $JOB_SCRATCH_PATH 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Copy input data to scratch directory 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+cp -r $INPUT_DATA $JOB_SCRATCH_PATH 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Make output data directory 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+mkdir -p $JOB_SCRATCH_PATH/output 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Run the application 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+singularity run --bind $JOB_SCRATCH_PATH:/scratch_mount --nv my_container.sif --input /scratch_mount/$INPUT_DIR --output /scratch_mount/output/ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Copy output from scratch 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+cp -r $JOB_SCRATCH_PATH/output $OUTPUT_DIR/output_${Slurm_ARRAY_JOB_ID}_${Slurm_ARRAY_TASK_ID} 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Clean up 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+rm -rf $JOB_SCRATCH_PATH 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+This example uses array job id and array task id to reduce the possibility of a 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+name clash when creating a directory on the scratch disk and when copying 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+outputs back.  Ideally each job will be given a scratch directory in a unique 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+namespace so there is no possibility of file or directory names clashing 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+between different jobs. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+### Template 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Collecting the above tips, here is a template batch script that can be adapted 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+to run these (or other) calculations on clusters with the Slurm scheduler. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+```bash 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+#!/bin/bash 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+########## 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Slurm parameters 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+########## 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# set the number of nodes 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+#SBATCH --nodes=... 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# set max wallclock time 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+#SBATCH --time=... 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# set name of job 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+#SBATCH --job-name=... 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# set number of GPUs 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+#SBATCH --gres=gpu:... 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+########## 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Job parameters 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+########## 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Path to scratch disk on host 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+HOST_SCRATCH_PATH=... 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Path to input data on host 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+INPUT_DATA=... 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Get name of input data directory 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+INPUT_DIR=$(basename $INPUT_DATA) 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Path to place output data on host 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+OUTPUT_DIR=... 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Define command to run 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+COMMAND=singularity exec --nv --bind $JOB_SCRATCH_PATH:/scratch_mount ... 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+########## 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Prepare data and directories in scratch space 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+########## 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Create a directory on scratch disk for this job 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+JOB_SCRATCH_PATH=$HOST_SCRATCH_PATH/${Slurm_JOB_NAME}_${Slurm_ARRAY_JOB_ID}_${Slurm_ARRAY_TASK_ID} 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+mkdir -p $JOB_SCRATCH_PATH 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Copy input data to scratch directory 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+cp -r $INPUT_DATA $JOB_SCRATCH_PATH 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Make output data directory 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+mkdir -p $JOB_SCRATCH_PATH/output 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+########## 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Monitor and run the job 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+########## 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# load modules (will be system dependent, may not be necessary) 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+module purge 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+module load singularity 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Monitor GPU usage 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+nvidia-smi dmon -o TD -s puct -d 300 > "dmon-${Slurm_ARRAY_JOB_ID}_${Slurm_ARRAY_TASK_ID}".txt & 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+GPU_WATCH_PID=$! 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Run command 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+date --iso-8601=seconds --utc 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+$COMMAND 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+date --iso-8601=seconds --utc 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+########## 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Post job clean up 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+########## 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Stop nvidia-smi dmon process 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+kill $GPU_WATCH_PID 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Copy output from scratch 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+cp -r $JOB_SCRATCH_PATH/output $OUTPUT_DIR/output_${Slurm_ARRAY_JOB_ID}_${Slurm_ARRAY_TASK_ID} 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+# Clean up 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+rm -rf $JOB_SCRATCH_PATH 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+``` 
			 |