vor 3 Jahren · f0f3df415b
--- a/README.md
+++ b/README.md
@@ -4,7 +4,7 @@
 
																 * No CUDA enabled GPU on your laptop
															
 
																 * Don’t want your laptop to become a radiator
															
 
																 * Run parallel experiments and get results quicker
															
 
																-* Hyper optimization
															
 
																+* Hyperparameter optimization
															
 
																 * BIG GPU’s
															
 
																 * Free service
															
 
																 * Relatively high freedom to manage your own environment
															
@@ -20,10 +20,158 @@
 
																 * I have a huge GPU in my desktop it’s enough for what I want to do
															
 
																 * Prefer to learn commercial cloud services (AWS GCP AZURE)
															
 
																+## Getting access
															
 
																+
															
 
																+First of all, it's worth checking if your supervisor already has accesss to HPC services and then requesting an account there. 
															
 
																+
															
 
																+If not you can request an account:
															
 
																+* CS HPC at https://hpc.cs.ucl.ac.uk/account-form/ 
															
 
																+* UCL Research Center (RC) at https://signup.rc.ucl.ac.uk/computing/requests/new
															
 
																+
															
 
																+This guide is going to use the CS HPC cluster but the two are very similar. Worth noting that I have borrowed loads from the *much better* docs of RC https://www.rc.ucl.ac.uk/docs/.
															
 
																+
															
 
																+The IHI is in the process so procuring its own HPC services but we're not there yet :(.
															
 
																+
															
 
																 ## How does it work?
															
 
																+![Network Diagram](images/network_diagram_basic.png)
															
 
																+
															
 
																+
															
 
																+## How do I use it?
															
 
																+
															
 
																+Most people use something like the following workflow:
															
 
																+
															
 
																+ - connect to the cluster's "login nodes"
															
 
																+ - create a script of commands to run programs
															
 
																+ - submit the script to the scheduler
															
 
																+ - wait for the scheduler to find available "compute nodes" and run the script
															
 
																+ - look at the results in the files the script created
															
 
																+
															
 
																+
															
 
																+### Logging In
															
 
																+
															
 
																+#### Simple way
															
 
																+
															
 
																+You will need to either use the [UCL Virtual Private Network](https://www.ucl.ac.uk/isd/services/get-connected/ucl-virtual-private-network-vpn/) or ssh in to UCL's gateway `tails.ucl.ac.uk` first. From tails you can then ssh in. 
															
 
																+
															
 
																+```
															
 
																+ssh <your_UCL_user_id>@tails.cs.ucl.ac.uk
															
 
																+ssh <your_UCL_user_id>@<login_node>.cs.ucl.ac.uk
															
 
																+```
															
 
																+
															
 
																+There are a few login nodes availabel but it shouldn't really matter which you use. `gamble` is the one I use. 
															
 
																+
															
 
																+#### Rapidos way
															
 
																+
															
 
																+You can also set up this in your `~/.ssh/config`
															
 
																+```
															
 
																+Host gamble
															
 
																+	HostName gamble.cs.ucl.ac.uk
															
 
																+	User vauvelle
															
 
																+	ProxyJump vauvelle@tails.cs.ucl.ac.uk
															
 
																+ ```
															
 
																+
															
 
																+## Scheduler
															
 
																+
															
 
																+### qsub
															
 
																+Submit a job to the scheduler with qsub
															
 
																+```bash
															
 
																+qsub /path/to/submission/script.sh
															
 
																+```
															
 
																+### qstat
															
 
																+Get the status of a job with qstat
															
 
																+```bash
															
 
																+qstat
															
 
																+job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
															
 
																+-----------------------------------------------------------------------------------------------------------------
															
 
																+6506636 0.00000 testing    jbloggs      qw    21/12/2012 11:11:11                                    1     
															
 
																+qstat -j <job-ID>
															
 
																+```
															
 
																+
															
 
																+### qdel
															
 
																+Delete a job to the scheduler with qdel
															
 
																+```bash
															
 
																+qdel /path/to/submission/script.sh
															
 
																+```
															
 
																+
															
 
																+## Serial Job Script Example
															
 
																+
															
 
																+The most basic type of job a user can submit is a serial job. These jobs run on a single processor (core) with a single thread. 
															
 
																+
															
 
																+Shown below is a simple job script that runs /bin/date (which prints the current date) on the compute node, and puts the output into a file.
															
 
																+
															
 
																+```bash
															
 
																+#!/bin/bash -l
															
 
																+
															
 
																+# Batch script to run a serial job under SGE.
															
 
																+
															
 
																+# Request ten minutes of wallclock time (format hours:minutes:seconds).
															
 
																+#$ -l h_rt=0:10:0
															
 
																+
															
 
																+# Request 1 gigabyte of RAM (must be an integer followed by M, G, or T)
															
 
																+#$ -l mem=1G
															
 
																+
															
 
																+# Request 15 gigabyte of TMPDIR space (default is 10 GB - remove if cluster is diskless)
															
 
																+#$ -l tmpfs=15G
															
 
																+
															
 
																+# Set the name of the job.
															
 
																+#$ -N Serial_Job
															
 
																+
															
 
																+# Set the working directory to somewhere in your scratch space.  
															
 
																+#  This is a necessary step as compute nodes cannot write to $HOME.
															
 
																+# Replace "<your_UCL_id>" with your UCL user ID.
															
 
																+#$ -wd /home/<your_UCL_id>/Scratch/workspace
															
 
																+
															
 
																+# Your work should be done in $TMPDIR 
															
 
																+cd $TMPDIR
															
 
																+
															
 
																+# Run the application and put the output into a file called date.txt
															
 
																+/bin/date > date.txt
															
 
																+
															
 
																+# Preferably, tar-up (archive) all output files onto the shared scratch area
															
 
																+tar -zcvf $HOME/Scratch/files_from_job_$JOB_ID.tar.gz $TMPDIR
															
 
																+
															
 
																+# Make sure you have given enough time for the copy to complete!
															
 
																+```
															
 
																+
															
 
																+## Hyperopt
															
 
																+![Network Diagram](images/network_diagram_hyperopt.png)
															
 
																+
															
 
																+My job script
															
 
																+
															
 
																+```
															
 
																+#$ -l tmem=16G
															
 
																+#$ -l h_rt=9:0:0
															
 
																+#$ -l gpu=true
															
 
																+#$ -S /bin/bash
															
 
																+#$ -j y
															
 
																+#$ -N gpu_worker50
															
 
																+#$ -t 1-10
															
 
																+#$ -tc 4
															
 
																+
															
 
																+#$ -o /home/vauvelle/doctor_signature/jobs/logs
															
 
																+hostname
															
 
																+date
															
 
																+PROJECT_DIR='/home/vauvelle/doctor_signature/'
															
 
																+export PYTHONPATH=$PYTHONPATH:$PROJECT_DIR
															
 
																+cd $PROJECT_DIR || exit
															
 
																+source /share/apps/source_files/python/python-3.7.0.source
															
 
																+source /share/apps/source_files/cuda/cuda-10.1.source
															
 
																+source .env
															
 
																+source ./.myenv/bin/activate
															
 
																+echo "Pulling any jobs with status 0"
															
 
																+hyperopt-mongo-worker --mongo=bigtop:27017/hyperopt --poll-interval=0.1 --max-consecutive-failures=5
															
 
																+date
															
 
																+```
															
 
																+
															
 
																+
															
 
																+Hyperopt mongo worker: http://hyperopt.github.io/hyperopt/scaleout/mongodb/
															
 
																+
															
 
																+
															
 
																+Other solutions: https://optuna.org/ (Uses a rdb instead of mongo)
															
 
																-![Network Diagram](images/network_diagram.png)
															
 
																+# Helpful stuff
															
 
																+File count quota on CS HPC is 150k, watchout for venv/conda lib folders can be huge. Find file count quickly with
															
 
																+```bash
															
 
																+rsync --stats --dry-run -ax /path/to/folder  /tmp
															
 
																+```
															
 
																-<p align="center">
															
 
																-  <img width="600" src="asciinema/logging_in.svg">
															
 
																-</p>
															
--- a/images/network_diagram_basic.png
+++ b/images/network_diagram_basic.png
--- a/images/network_diagram_hyperopt.png
+++ b/images/network_diagram_hyperopt.png