Advanced Use: Interactive, X Forwarding, Job Arrays, Task Geometry, Parallel Batch Submission

Interactive Use

In order to use an HPC system interactively – i.e. whilst sat in front of the terminal interacting live with allocated resources – there is a simple two stage process in Slurm.

Firstly, we must create an allocation – that is, an allocation of a certain amount of resources that we specify we need. This is done using the salloc command, like this:

[test.user@cstl001 imb]$ salloc -n 8 --ntasks-per-node=1
 salloc: Granted job allocation 134

Now that an allocation has been granted, we have access to those specified resources. Note that the resource specification we made in this case is exactly as the parameters passed for batch use was – so in this case we have asked for 8 tasks (processes) with them distributed at one per node.

Now that we are ‘inside’ an allocation, we can use the srun command to execute against the allocated resources, for example:

[test.user@cstl001 imb]$ srun hostname
cst004
cst003
cst002
cst008
cst006
cst007
cst001
cst005

The above output shows how, by default, srun executes a command on all allocated processors. Arguments can be passed to srun to operate differently, for example:

We could also launch an MPI job here if we wished. We would load the software modules as we do in a batch script and call mpirun in the same way. This can be useful during code debugging.

It is also possible to use srun to launch an interactive shell process for some heavy processing on a compute node, for example:

srun -n 2 --pty bash

This would move us to a shell on a compute node.

X11 Forwarding Interactively

Once a resource allocation is granted as per the above, we can use srun to provide X11 graphical forwarding all the way from the compute nodes to our desktop using srun –x11 <application>.

For example, to run an X terminal:

srun --x11 xterm

Note that the user must have X11 forwarded to the login node for this to work – this can be checked by running xclock at the command line.

Additionally, the –x11 argument can be augmented in this fashion –x11=[batch|first|last|all] to the following effects:

–x11=first This is the default, and provides X11 forwarding to the first compute hosts allocated.
–x11=last This provides X11 forwarding to the last of the compute hosts allocated.
–x11=all This provides X11 forwarding from all allocated compute hosts, which can be quite resource heavy and is an extremely rare use-case.
–x11=batch This supports use in a batch job submission, and will provide X11 forwarding to the first node allocated to a batch job. The user must leave open the X11 forwarded login node session where they submitted the job.

Job Arrays

Submission

Job arrays operate in Slurm much as they do in other batch systems. They enable a potentially huge number of similar jobs to be launched very quickly and simply, with the value of a runtime-assigned array id then being used to cause each particular job iteration to vary slightly what it does. Array jobs are declared using the –array argument to sbatch, which can (as with all arguments to sbatch) be inside as job script as an #SBATCH declaration or passed as a direct argument to sbatch. There are a number of ways to declare:

[test.user@cstl001 hello_world]$ sbatch --array=0-64 sbatch_sub.sh

…declares an array with iteration indexes from 0 to 64.

[test.user@cstl001 hello_world]$ sbatch --array=0,4,8,12 sbatch_sub.sh

…declares an array with iteration indexes specifically identified as 0, 4, 8 and 12.

[test.user@cstl001 hello_world]$ sbatch --array=0-12:3 sbatch_sub.sh

…declares an array with iteration indexes from 0 to 12 with a stepping of 3, i.e. 0,3,6,9,12

Monitoring

When a job array is running, the output of squeue shows the parent task and the currently running iteration indexes:

[test.user@cstl001 hello_world]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        143_[6-64]       all    hello test.use PD       0:00      4 (Resources)
             143_4       all    hello test.use  R       0:00      4 cst[005-008]
             143_5       all    hello test.use  R       0:00      4 cst[005-008]
             143_0       all    hello test.use  R       0:03      4 cst[001-004]
             143_1       all    hello test.use  R       0:03      4 cst[001-004]
[test.user@cstl001 hello_world]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
       143_[15-64]       all    hello test.use PD       0:00      4 (Resources)
            143_14       all    hello test.use  R       0:00      4 cst[001-004]
            143_10       all    hello test.use  R       0:02      4 cst[005-008]
            143_11       all    hello test.use  R       0:02      4 cst[005-008]
             143_1       all    hello test.use  R       0:07      4 cst[001-004]

IDs and Variables

Each iteration in an array assumes its own job ID in Slurm. However, Slurm creates two new environment variables that can be used in the script in addition to SLURM_JOB_ID storing the particular iteration’s job ID.

SLURM_ARRAY_JOB_ID stores the value of the parent job submission – i.e. the ID reported in the output from sbatch when submitted.

SLURM_ARRAY_TASK_ID stores the value of the array index.

Additionally, when specifying a job’s STDOUT and STDERR files using the -o and -e directives to sbatch, the reference %A will take on the parent job ID and the reference %a will take on the iteration index. In summary:

BASH Environment Variable	SBATCH Field Code	Description
$SLURM_JOB_ID	%J	Job identifier
$SLURM_ARRAY_JOB_ID	%A	Array parent job identifier
$SLURM_ARRAY_TASK_ID	%a	Array job iteration index
$SLURM_ARRAY_TASK_COUNT		Number of indexes (tasks) in the job array
$SLURM_ARRAY_TASK_MAX		Maximum array index
$SLURM_ARRAY_TASK_MIN		Minimum array index

And so, with this example script:

#!/bin/bash

#SBATCH -J arraytest
#SBATCH --array=0-4
#SBATCH -o output-%A_%a-%J.o
#SBATCH -n 1

echo SLURM_JOB_ID $SLURM_JOB_ID
echo SLURM_ARRAY_JOB_ID $SLURM_ARRAY_JOB_ID
echo SLURM_ARRAY_TASK_ID $SLURM_ARRAY_TASK_ID

We can submit the script:

[test.user@cstl001 sbatch]$ sbatch array.sh 
Submitted batch job 231

Resulting in the following output files:

output-231_0-232.o
output-231_1-233.o
output-231_2-234.o
output-231_3-235.o
output-231_4-231.o

Each iteration of which contained variables as follows:

output-231_0-232.o:
SLURM_JOB_ID 232
SLURM_ARRAY_JOB_ID 231
SLURM_ARRAY_TASK_ID 0

output-231_1-233.o:
SLURM_JOB_ID 233
SLURM_ARRAY_JOB_ID 231
SLURM_ARRAY_TASK_ID 1

output-231_2-234.o:
SLURM_JOB_ID 234
SLURM_ARRAY_JOB_ID 231
SLURM_ARRAY_TASK_ID 2

output-231_3-235.o:
SLURM_JOB_ID 235
SLURM_ARRAY_JOB_ID 231
SLURM_ARRAY_TASK_ID 3

output-231_4-231.o:
SLURM_JOB_ID 231
SLURM_ARRAY_JOB_ID 231
SLURM_ARRAY_TASK_ID 4

More advanced job array information is available in the Slurm documentation here.

Task Geometry

If you need to run an MPI (/OpenMP) task that requires a custom task geometry, perhaps because one task requires a larger amount of memory than the others, then this can easily be achieved with Slurm.

To do this, rather than specifiying the number of processors required, one can specify the number of nodes (#SBATCH –nodes=X) plus the number of tasks per node (#SBATCH –tasks-per-node=X). The geometry can then be defined to the SLURM_TASKS_PER_NODE environment variable at runtime. As long as there are enough nodes to match the geometry, then Slurm will allocate parallel tasks to the MPI runtime to follow the geometry specification.

For example:

#!/bin/bash --login

#SBATCH --job-name geom_test
#SBATCH --nodes 4
#SBATCH --ntasks-per-node 16
#SBATCH --time 00:10:00
#SBATCH --output geom_test.%J.out

module purge
module load mpi/intel/5.1

export SLURM_TASKS_PER_NODE='1,16(x2),6'
mpirun ./mpi_test

In this case, we are requesting 4 nodes and (all) 16 processors on those 4 nodes. Therefore, a maximum job size of 64 parallel tasks (to match the number of allocated processors) would apply. However, we override the SLURM_TASKS_PER_NODE environment variable to be just a single task on the first node, then fill the next two allocated nodes, and then place just six parallel tasks on the final allocated node. So, in this case, a total of 1+16+16+6=39 parallel processes. ‘mpirun’ will automatically pick this up from the Slurm allocated runtime environment.

Parallel Batch Submission of Serial Jobs

Large numbers of serial jobs can become incredibly inefficient and troublesome on mixed-mode HPC systems. The HPCW Slurm deployment limits the number of running & submitted jobs any single user may have, in contrast to the unlimited submission possible under the previous deployment of LSF.

However, combining GNU Parallel and Slurm’s srun command allows us to handle such situations in a more controlled and efficient way than in the past. Using this method, a single job is submitted that requests an allocation of X cores, and the GNU paralllel command enables us to utilise all of those cores by launching the serial tasks using the srun command.

Here is the example, commented, job submission file serial_batch.sh:

#!/bin/bash --login
#SBATCH -n 12                     #Number of processors in our pool
#SBATCH -o output.%J              #Job output
#SBATCH -t 12:00:00               #Max wall time for entire job

module purge
module load parallel

# Define srun arguments:
srun="srun -n1 -N1 --exclusive"
# --exclusive     ensures srun uses distinct CPUs for each job step
# -N1 -n1         allocates a single core to each task

# Define parallel arguments:
parallel="parallel -N 1 --delay .2 -j $SLURM_NTASKS --joblog parallel_joblog --resume"
# -N 1              is number of arguments to pass to each job
# --delay .2        prevents overloading the controlling node on short jobs
# -j $SLURM_NTASKS  is the number of concurrent tasks parallel runs, so number of CPUs allocated
# --joblog name     parallel's log file of tasks it has run
# --resume          parallel can use a joblog and this to continue an interrupted run (job resubmitted)

# Run the tasks:
$parallel "$srun ./runtask arg1:{1}" ::: {1..32}
# in this case, we are running a script named runtask, and passing it a single argument
# {1} is the first argument
# parallel uses ::: to separate options. Here {1..32} is a shell expansion defining the values for the first argument, but could be any shell command
#
# so parallel will run the runtask script for the numbers 1 through 32, with a max of 12 running at any one time
#
# as an example, the first job will be run like this:
# srun -N1 -n1 --exclusive ./runtask arg1:1

So, in the above we are requesting an allocation from Slurm of 12 processors, but we have 32 tasks to run. Parallel will execute the jobs as soon as space on our allocation becomes available (i.e. tasks finish). As this does not have the overhead of setting up a new full job, it is more efficient.

A simple ‘runtask’ script that demonstrates the principal by logging helpful text is included here, courtesy of the University of Chicago Research Computing Centre:

<span class="c">#!/bin/sh</span>

<span class="c"># this script echoes some useful output so we can see what parallel</span>
<span class="c"># and srun are doing</span>

<span class="nv">sleepsecs</span><span class="o">=</span><span class="nv">$[</span> <span class="o">(</span> <span class="nv">$RANDOM</span> % <span class="m">10</span> <span class="o">)</span>  + <span class="m">10</span> <span class="o">]</span>s

<span class="c"># $1 is arg1:{1} from parallel.</span>
<span class="c"># $PARALLEL_SEQ is a special variable from parallel. It the actual sequence</span>
<span class="c"># number of the job regardless of the arguments given</span>
<span class="c"># We output the sleep time, hostname, and date for more info</span>
<span class="nb">echo </span>task <span class="nv">$1</span> seq:<span class="nv">$PARALLEL_SEQ</span> sleep:<span class="nv">$sleepsecs</span> host:<span class="k">$(</span>hostname<span class="k">)</span> date:<span class="k">$(</span>date<span class="k">)</span>

<span class="c"># sleep a random amount of time</span>
sleep <span class="nv">$sleepsecs</span>

So, one would simply submit the job script as per normal:

$ sbatch serial_batch.sh

And we then see output in the Slurm job output file like this:

...
task arg1:9 seq:9 sleep:10s host:bwc048 date:Thu Jan 28 17:28:14 GMT 2016
task arg1:7 seq:7 sleep:11s host:bwc047 date:Thu Jan 28 17:28:14 GMT 2016
task arg1:10 seq:10 sleep:11s host:bwc048 date:Thu Jan 28 17:28:14 GMT 2016
task arg1:8 seq:8 sleep:14s host:bwc047 date:Thu Jan 28 17:28:14 GMT 2016
...

Also the parallel job log records completed tasks:

Seq     Host    Starttime       JobRuntime      Send    Receive Exitval Signal  Command
9       :       1454002094.231      10.588      0       74      0       0       srun -n1 -N1 --exclusive ./runtask arg1:9
7       :       1454002093.809      11.602      0       74      0       0       srun -n1 -N1 --exclusive ./runtask arg1:7
10      :       1454002094.435      11.384      0       76      0       0       srun -n1 -N1 --exclusive ./runtask arg1:10
8       :       1454002094.023      14.388      0       74      0       0       srun -n1 -N1 --exclusive ./runtask arg1:8
...

So, by tweeking a few simple commands in the job script and having a ‘runtask’ script that does something useful, we can accomplish a neat, efficient serial batch system.