Advanced Use: Interactive, X Forwarding, Job Arrays, Task Geometry, Parallel Batch Submission
Interactive Use
In order to use an HPC system interactively – i.e. whilst sat in front of the terminal interacting live with allocated resources – there is a simple two stage process in Slurm.
Firstly, we must create an allocation – that is, an allocation of a certain amount of resources that we specify we need. This is done using the salloc command, like this:
[test.user@cstl001 imb]$ salloc -n 8 --ntasks-per-node=1 salloc: Granted job allocation 134
Now that an allocation has been granted, we have access to those specified resources. Note that the resource specification we made in this case is exactly as the parameters passed for batch use was – so in this case we have asked for 8 tasks (processes) with them distributed at one per node.
Now that we are ‘inside’ an allocation, we can use the srun command to execute against the allocated resources, for example:
[test.user@cstl001 imb]$ srun hostname cst004 cst003 cst002 cst008 cst006 cst007 cst001 cst005
The above output shows how, by default, srun executes a command on all allocated processors. Arguments can be passed to srun to operate differently, for example:
We could also launch an MPI job here if we wished. We would load the software modules as we do in a batch script and call mpirun in the same way. This can be useful during code debugging.
It is also possible to use srun to launch an interactive shell process for some heavy processing on a compute node, for example:
srun -n 2 --pty bash
This would move us to a shell on a compute node.
X11 Forwarding Interactively
Once a resource allocation is granted as per the above, we can use srun to provide X11 graphical forwarding all the way from the compute nodes to our desktop using srun –x11 <application>.
For example, to run an X terminal:
srun --x11 xterm
Note that the user must have X11 forwarded to the login node for this to work – this can be checked by running xclock at the command line.
Additionally, the –x11 argument can be augmented in this fashion –x11=[batch|first|last|all] to the following effects:
- –x11=first This is the default, and provides X11 forwarding to the first compute hosts allocated.
- –x11=last This provides X11 forwarding to the last of the compute hosts allocated.
- –x11=all This provides X11 forwarding from all allocated compute hosts, which can be quite resource heavy and is an extremely rare use-case.
- –x11=batch This supports use in a batch job submission, and will provide X11 forwarding to the first node allocated to a batch job. The user must leave open the X11 forwarded login node session where they submitted the job.
Job Arrays
Submission
Job arrays operate in Slurm much as they do in other batch systems. They enable a potentially huge number of similar jobs to be launched very quickly and simply, with the value of a runtime-assigned array id then being used to cause each particular job iteration to vary slightly what it does. Array jobs are declared using the –array argument to sbatch, which can (as with all arguments to sbatch) be inside as job script as an #SBATCH declaration or passed as a direct argument to sbatch. There are a number of ways to declare:
[test.user@cstl001 hello_world]$ sbatch --array=0-64 sbatch_sub.sh
…declares an array with iteration indexes from 0 to 64.
[test.user@cstl001 hello_world]$ sbatch --array=0,4,8,12 sbatch_sub.sh
…declares an array with iteration indexes specifically identified as 0, 4, 8 and 12.
[test.user@cstl001 hello_world]$ sbatch --array=0-12:3 sbatch_sub.sh
…declares an array with iteration indexes from 0 to 12 with a stepping of 3, i.e. 0,3,6,9,12
Monitoring
When a job array is running, the output of squeue shows the parent task and the currently running iteration indexes:
[test.user@cstl001 hello_world]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 143_[6-64] all hello test.use PD 0:00 4 (Resources) 143_4 all hello test.use R 0:00 4 cst[005-008] 143_5 all hello test.use R 0:00 4 cst[005-008] 143_0 all hello test.use R 0:03 4 cst[001-004] 143_1 all hello test.use R 0:03 4 cst[001-004] [test.user@cstl001 hello_world]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 143_[15-64] all hello test.use PD 0:00 4 (Resources) 143_14 all hello test.use R 0:00 4 cst[001-004] 143_10 all hello test.use R 0:02 4 cst[005-008] 143_11 all hello test.use R 0:02 4 cst[005-008] 143_1 all hello test.use R 0:07 4 cst[001-004]
IDs and Variables
Each iteration in an array assumes its own job ID in Slurm. However, Slurm creates two new environment variables that can be used in the script in addition to SLURM_JOB_ID storing the particular iteration’s job ID.
SLURM_ARRAY_JOB_ID stores the value of the parent job submission – i.e. the ID reported in the output from sbatch when submitted.
SLURM_ARRAY_TASK_ID stores the value of the array index.
Additionally, when specifying a job’s STDOUT and STDERR files using the -o and -e directives to sbatch, the reference %A will take on the parent job ID and the reference %a will take on the iteration index. In summary:
BASH Environment Variable | SBATCH Field Code | Description |
---|---|---|
$SLURM_JOB_ID | %J | Job identifier |
$SLURM_ARRAY_JOB_ID | %A | Array parent job identifier |
$SLURM_ARRAY_TASK_ID | %a | Array job iteration index |
$SLURM_ARRAY_TASK_COUNT | Number of indexes (tasks) in the job array | |
$SLURM_ARRAY_TASK_MAX | Maximum array index | |
$SLURM_ARRAY_TASK_MIN | Minimum array index |
And so, with this example script:
#!/bin/bash #SBATCH -J arraytest #SBATCH --array=0-4 #SBATCH -o output-%A_%a-%J.o #SBATCH -n 1 echo SLURM_JOB_ID $SLURM_JOB_ID echo SLURM_ARRAY_JOB_ID $SLURM_ARRAY_JOB_ID echo SLURM_ARRAY_TASK_ID $SLURM_ARRAY_TASK_ID
We can submit the script:
[test.user@cstl001 sbatch]$ sbatch array.sh Submitted batch job 231
Resulting in the following output files:
output-231_0-232.o output-231_1-233.o output-231_2-234.o output-231_3-235.o output-231_4-231.o
Each iteration of which contained variables as follows:
output-231_0-232.o: SLURM_JOB_ID 232 SLURM_ARRAY_JOB_ID 231 SLURM_ARRAY_TASK_ID 0
output-231_1-233.o: SLURM_JOB_ID 233 SLURM_ARRAY_JOB_ID 231 SLURM_ARRAY_TASK_ID 1
output-231_2-234.o: SLURM_JOB_ID 234 SLURM_ARRAY_JOB_ID 231 SLURM_ARRAY_TASK_ID 2
output-231_3-235.o: SLURM_JOB_ID 235 SLURM_ARRAY_JOB_ID 231 SLURM_ARRAY_TASK_ID 3
output-231_4-231.o: SLURM_JOB_ID 231 SLURM_ARRAY_JOB_ID 231 SLURM_ARRAY_TASK_ID 4
More advanced job array information is available in the Slurm documentation here.
Task Geometry
If you need to run an MPI (/OpenMP) task that requires a custom task geometry, perhaps because one task requires a larger amount of memory than the others, then this can easily be achieved with Slurm.
To do this, rather than specifiying the number of processors required, one can specify the number of nodes (#SBATCH –nodes=X) plus the number of tasks per node (#SBATCH –tasks-per-node=X). The geometry can then be defined to the SLURM_TASKS_PER_NODE environment variable at runtime. As long as there are enough nodes to match the geometry, then Slurm will allocate parallel tasks to the MPI runtime to follow the geometry specification.
For example:
#!/bin/bash --login #SBATCH --job-name geom_test #SBATCH --nodes 4 #SBATCH --ntasks-per-node 16 #SBATCH --time 00:10:00 #SBATCH --output geom_test.%J.out module purge module load mpi/intel/5.1 export SLURM_TASKS_PER_NODE='1,16(x2),6' mpirun ./mpi_test
In this case, we are requesting 4 nodes and (all) 16 processors on those 4 nodes. Therefore, a maximum job size of 64 parallel tasks (to match the number of allocated processors) would apply. However, we override the SLURM_TASKS_PER_NODE environment variable to be just a single task on the first node, then fill the next two allocated nodes, and then place just six parallel tasks on the final allocated node. So, in this case, a total of 1+16+16+6=39 parallel processes. ‘mpirun’ will automatically pick this up from the Slurm allocated runtime environment.
Parallel Batch Submission of Serial Jobs
Large numbers of serial jobs can become incredibly inefficient and troublesome on mixed-mode HPC systems. The HPCW Slurm deployment limits the number of running & submitted jobs any single user may have, in contrast to the unlimited submission possible under the previous deployment of LSF.
However, combining GNU Parallel and Slurm’s srun command allows us to handle such situations in a more controlled and efficient way than in the past. Using this method, a single job is submitted that requests an allocation of X cores, and the GNU paralllel command enables us to utilise all of those cores by launching the serial tasks using the srun command.
Here is the example, commented, job submission file serial_batch.sh:
#!/bin/bash --login #SBATCH -n 12 #Number of processors in our pool #SBATCH -o output.%J #Job output #SBATCH -t 12:00:00 #Max wall time for entire job module purge module load parallel # Define srun arguments: srun="srun -n1 -N1 --exclusive" # --exclusive ensures srun uses distinct CPUs for each job step # -N1 -n1 allocates a single core to each task # Define parallel arguments: parallel="parallel -N 1 --delay .2 -j $SLURM_NTASKS --joblog parallel_joblog --resume" # -N 1 is number of arguments to pass to each job # --delay .2 prevents overloading the controlling node on short jobs # -j $SLURM_NTASKS is the number of concurrent tasks parallel runs, so number of CPUs allocated # --joblog name parallel's log file of tasks it has run # --resume parallel can use a joblog and this to continue an interrupted run (job resubmitted) # Run the tasks: $parallel "$srun ./runtask arg1:{1}" ::: {1..32} # in this case, we are running a script named runtask, and passing it a single argument # {1} is the first argument # parallel uses ::: to separate options. Here {1..32} is a shell expansion defining the values for the first argument, but could be any shell command # # so parallel will run the runtask script for the numbers 1 through 32, with a max of 12 running at any one time # # as an example, the first job will be run like this: # srun -N1 -n1 --exclusive ./runtask arg1:1
So, in the above we are requesting an allocation from Slurm of 12 processors, but we have 32 tasks to run. Parallel will execute the jobs as soon as space on our allocation becomes available (i.e. tasks finish). As this does not have the overhead of setting up a new full job, it is more efficient.
A simple ‘runtask’ script that demonstrates the principal by logging helpful text is included here, courtesy of the University of Chicago Research Computing Centre:
<span class="c">#!/bin/sh</span> <span class="c"># this script echoes some useful output so we can see what parallel</span> <span class="c"># and srun are doing</span> <span class="nv">sleepsecs</span><span class="o">=</span><span class="nv">$[</span> <span class="o">(</span> <span class="nv">$RANDOM</span> % <span class="m">10</span> <span class="o">)</span> + <span class="m">10</span> <span class="o">]</span>s <span class="c"># $1 is arg1:{1} from parallel.</span> <span class="c"># $PARALLEL_SEQ is a special variable from parallel. It the actual sequence</span> <span class="c"># number of the job regardless of the arguments given</span> <span class="c"># We output the sleep time, hostname, and date for more info</span> <span class="nb">echo </span>task <span class="nv">$1</span> seq:<span class="nv">$PARALLEL_SEQ</span> sleep:<span class="nv">$sleepsecs</span> host:<span class="k">$(</span>hostname<span class="k">)</span> date:<span class="k">$(</span>date<span class="k">)</span> <span class="c"># sleep a random amount of time</span> sleep <span class="nv">$sleepsecs</span>
So, one would simply submit the job script as per normal:
$ sbatch serial_batch.sh
And we then see output in the Slurm job output file like this:
... task arg1:9 seq:9 sleep:10s host:bwc048 date:Thu Jan 28 17:28:14 GMT 2016 task arg1:7 seq:7 sleep:11s host:bwc047 date:Thu Jan 28 17:28:14 GMT 2016 task arg1:10 seq:10 sleep:11s host:bwc048 date:Thu Jan 28 17:28:14 GMT 2016 task arg1:8 seq:8 sleep:14s host:bwc047 date:Thu Jan 28 17:28:14 GMT 2016 ...
Also the parallel job log records completed tasks:
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command 9 : 1454002094.231 10.588 0 74 0 0 srun -n1 -N1 --exclusive ./runtask arg1:9 7 : 1454002093.809 11.602 0 74 0 0 srun -n1 -N1 --exclusive ./runtask arg1:7 10 : 1454002094.435 11.384 0 76 0 0 srun -n1 -N1 --exclusive ./runtask arg1:10 8 : 1454002094.023 14.388 0 74 0 0 srun -n1 -N1 --exclusive ./runtask arg1:8 ...
So, by tweeking a few simple commands in the job script and having a ‘runtask’ script that does something useful, we can accomplish a neat, efficient serial batch system.