Slurm reference and useful things for running jobs on the labs cluster. Might be useful for other people, posting here for future reference.
Base slurm batch file
Dump stdout and stderr to /dev/null
Often times for large jobs with lots of output its not worth it to keep the programs output if everything is working properly.
Generate data in RAM drive
Each node has a /dev/shm folder where temporary data can be stored really quickly as it is in memory and no disk i/o needs to be performed.
If a job is canceled then the folder will remain in memory, run the following script to force a cleanup. Always use rm -rf with caution.
Only perform tasks for missing files
Sometimes when performing large jobs, some of the tasks can fail. If the user onyl wants to run a job for the missing files the following pattern can be used
Problems with modules not being loaded properly
If a specific module needs to be loaded, add the following to the list of tasks run by the job.
Request a specific GPU resource
The gres argument can be used to request a specific type of resource
Reservations
To view all reservations on the cluster
Run a job using your reservation, note that this will only run on your reservation and none of the other nodes.