Batch Submission of Serial Jobs for Parallel Execution
Large numbers of serial jobs can become incredibly inefficient and troublesome on mixed-mode HPC systems. The SCW Slurm deployment limits the number of running & submitted jobs any single user may have.
However, there are ways to submit multiple jobs:
- Background jobs using shell process control and wait for processes to finish on a single node.
- Combining GNU Parallel and Slurm’s srun command allows us to handle such situations in a more controlled and efficient way than in the past. Using this method, a single job is submitted that requests an allocation of X cores, and the GNU parallel command enables us to utilise all of those cores by launching the serial tasks using the srun command.
- Using Job Arrays for very similar tasks. See here.
Shell process control
Here is an example of submitting 2 processes on a single node:
#!/bin/bash #SBATCH --ntasks=32 #SBATCH --ntasks-per-node=32 #SBATCH -o example.log.%J #SBATCH -e example.err.%J #SBATCH -J example #set the partition, use compute if running in Swansea #SBATCH -p htc #SBATCH --time=1:00:00 #SBATCH --exclusive time my_exec < input1.csv > input1.log.$SLURM_JOBID & time my_exec < input2.csv > input2.log.$SLURM_JOBID & # important to make sure the batch job won't exit before all the # simultaneous runs are completed. wait
The my_exec commands in this case would be multithreaded to use 32 cores between them.
GNU Parallel and Slurm’s srun command
Here is the example, commented, job submission file serial_batch.sh:
#!/bin/bash --login #SBATCH -n 40 #Number of processors in our pool #SBATCH -o output.%J #Job output #SBATCH -t 12:00:00 #Max wall time for entire job #change the partition to compute if running in Swansea #SBATCH -p htc #Use the High Throughput partition which is intended for serial jobs module purge module load hpcw module load parallel # Define srun arguments: srun="srun -n1 -N1 --exclusive" # --exclusive ensures srun uses distinct CPUs for each job step # -N1 -n1 allocates a single core to each task # Define parallel arguments: parallel="parallel -N 1 --delay .2 -j $SLURM_NTASKS --joblog parallel_joblog --resume" # -N 1 is number of arguments to pass to each job # --delay .2 prevents overloading the controlling node on short jobs # -j $SLURM_NTASKS is the number of concurrent tasks parallel runs, so number of CPUs allocated # --joblog name parallel's log file of tasks it has run # --resume parallel can use a joblog and this to continue an interrupted run (job resubmitted) # Run the tasks: $parallel "$srun ./runtask arg1:{1}" ::: {1..64} # in this case, we are running a script named runtask, and passing it a single argument # {1} is the first argument # parallel uses ::: to separate options. Here {1..64} is a shell expansion defining the values for # the first argument, but could be any shell command # # so parallel will run the runtask script for the numbers 1 through 64, with a max of 40 running # at any one time # # as an example, the first job will be run like this: # srun -N1 -n1 --exclusive ./runtask arg1:1
So, in the above we are requesting an allocation from Slurm of 12 processors, but we have 32 tasks to run. Parallel will execute the jobs as soon as space on our allocation becomes available (i.e. tasks finish). As this does not have the overhead of setting up a new full job, it is more efficient.
A simple ‘runtask’ script that demonstrates the principal by logging helpful text is included here, courtesy of the University of Chicago Research Computing Centre:
#!/bin/sh # this script echoes some useful output so we can see what parallel # and srun are doing sleepsecs=$[($RANDOM % 10) + 10]s # $1 is arg1:{1} from parallel. # $PARALLEL_SEQ is a special variable from parallel. It the actual sequence # number of the job regardless of the arguments given # We output the sleep time, hostname, and date for more info> echo task $1 seq:$PARALLEL_SEQ sleep:$sleepsecs host:$(hostname) date:$(date) # sleep a random amount of time sleep $sleepsecs
So, one would simply submit the job script as per normal:
$ sbatch serial_batch.sh
And we then see output in the Slurm job output file like this:
... task arg1:34 seq:34 sleep:11s host:ccs0132 date:Fri 29 Jun 09:37:26 BST 2018 srun: Exclusive so allocate job details task arg1:38 seq:38 sleep:12s host:ccs0132 date:Fri 29 Jun 09:37:27 BST 2018 srun: Exclusive so allocate job details task arg1:45 seq:45 sleep:11s host:ccs0132 date:Fri 29 Jun 09:37:29 BST 2018 srun: Exclusive so allocate job details task arg1:41 seq:41 sleep:12s host:ccs0132 date:Fri 29 Jun 09:37:28 BST 2018 srun: Exclusive so allocate job details task arg1:47 seq:47 sleep:11s host:ccs0132 date:Fri 29 Jun 09:37:29 BST 2018 srun: Exclusive so allocate job details ...
Also the parallel job log records completed tasks:
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command 8 : 1530261102.040 11.088 0 75 0 0 srun -n1 -N1 --exclusive ./runtask arg1:8 9 : 1530261102.248 11.088 0 75 0 0 srun -n1 -N1 --exclusive ./runtask arg1:9 5 : 1530261101.385 12.088 0 75 0 0 srun -n1 -N1 --exclusive ./runtask arg1:5 12 : 1530261102.897 12.105 0 77 0 0 srun -n1 -N1 --exclusive ./runtask arg1:12 1 : 1530261100.475 17.082 0 75 0 0 srun -n1 -N1 --exclusive ./runtask arg1:1 2 : 1530261100.695 17.091 0 75 0 0 srun -n1 -N1 --exclusive ./runtask arg1:2 3 : 1530261100.926 17.088 0 75 0 0 srun -n1 -N1 --exclusive ./runtask arg1:3 10 : 1530261102.450 16.088 0 77 0 0 srun -n1 -N1 --exclusive ./runtask arg1:10 6 : 1530261101.589 17.082 0 75 0 0 srun -n1 -N1 --exclusive ./runtask arg1:6 ...
So, by tweaking a few simple commands in the job script and having a ‘runtask’ script that does something useful, we can accomplish a neat, efficient serial batch system.
Multi-Threaded Tasks
It is trivially possible to use the above technique and scripts, with a very small modification, to run multi-threaded or otherwise intra-node parallel tasks. We achieve this by changing the SBATCH directive specifying processor requirement (#SBATCH -n …) in the submission script to the following form:
#SBATCH --nodes=3 #SBATCH --ntasks-per-node=3 #SBATCH --cpus-per-task=4
In this case, parallel will launch across 3 nodes, and will run 3 tasks of 4 processor each per node.