Slurm reference and useful things for running jobs on the labs cluster. Might be useful for other people, posting here for future reference.

Base slurm batch file

#!/bin/bash
#SBATCH -N 1                      # This requests one node
#SBATCH -n 8                      # Request 8 cores on a node
#SBATCH -t 0-10:00                # Walltime days-hours:minutes
#SBATCH --job-name=job_name       # Name of the job
#SBATCH --array=0-100             # Array job with 100 tasks
cd $SLURM_SUBMIT_DIR              # Change directory to the one where the script was executed
./ThingToRun $SLURM_ARRAY_TASK_ID # Executable that will be run for this job

Dump stdout and stderr to /dev/null

Often times for large jobs with lots of output its not worth it to keep the programs output if everything is working properly.

#SBATCH -o /dev/null        # Dump std out to null device
#SBATCH -e /dev/null        # Dump std err to null device

Generate data in RAM drive

Each node has a /dev/shm folder where temporary data can be stored really quickly as it is in memory and no disk i/o needs to be performed.

# Create folder in RAM drive
mkdir -p /dev/shm/$SLURM_JOBID/
# Run code that will convert data stored on disk to a differnt format, store this data in memory
./convert input_$SLURM_ARRAY_TASK_ID.dat /dev/shm/$SLURM_JOBID/output_$SLURM_ARRAY_TASK_ID.dat
# Do something with the data in the RAM drive
./doStuff /dev/shm/$SLURM_JOBID/output_$SLURM_ARRAY_TASK_ID.dat
# IMPORTANT, delete the folder in memory when done
rm -rf /dev/shm/$SLURM_JOBID

If a job is canceled then the folder will remain in memory, run the following script to force a cleanup. Always use rm -rf with caution.

#!/bin/bash
array=(node1 node2 node3 node4)
for i in "${array[@]}"
do
  ssh $i 'rm -rf /dev/shm/*'    #Remove all files in /dev/shm/ that you have write access to
  echo $i
done

Only perform tasks for missing files

Sometimes when performing large jobs, some of the tasks can fail. If the user onyl wants to run a job for the missing files the following pattern can be used

#If the image does not exist then execute 
if [ ! -f output/$SLURM_ARRAY_TASK_ID.png ]; then
#perform task
./RenderImage $SLURM_ARRAY_TASK_ID
fi
done

Problems with modules not being loaded properly

If a specific module needs to be loaded, add the following to the list of tasks run by the job.

module load modulename/version

Request a specific GPU resource

The gres argument can be used to request a specific type of resource

#SBATCH --gres=gpu:titanx:1   # Request one titanx

Reservations

To view all reservations on the cluster

scontrol show res

Run a job using your reservation, note that this will only run on your reservation and none of the other nodes.

sbatch --reservation=RESERVATION_NAME slurm_job.sh