Submitting, Monitoring & Cancelling Jobs

Submitting a Job

Running jobs on SCW clusters using Slurm is a simple process. You need a job script, which defines the resource requirements for the job and then is a batch script of the commands to run. The resource requirements & other job parameters are specified on lines that start with a #SBATCH instruction, thus they are treated as comments by anything other than the Slurm scheduler. The batch of commands to run is just as if you were typing them at the command line.

For example, a simple script might look like:

#!/bin/bash --login
###
#job name
#SBATCH --job-name=imb_bench
#job stdout file
#SBATCH --output=bench.out.%J
#job stderr file
#SBATCH --error=bench.err.%J
#maximum job time in D-HH:MM
#SBATCH --time=0-00:20
#number of parallel processes (tasks) you are requesting - maps to MPI processes
#SBATCH --ntasks=80 
#memory per process in MB 
#SBATCH --mem-per-cpu=4000 
#tasks to run per node (change for hybrid OpenMP/MPI) 
#SBATCH --ntasks-per-node=40
###

#now run normal batch commands 
module load compiler/intel mpi/intel

#run Intel MPI Benchmarks with mpirun - will automatically pick up Slurm parallel environment
mpirun $MPI_HOME/intel64/bin/IMB-MPI1

The directives to Slurm are quite clear and self-descriptive. Of particular note is the memory specification – Slurm is very good at scheduling around and subsequently controlling job memory usage. Too low a memory request can result in a job crashing or being cancelled, but too high a value can result in a job waiting for longer than necessary.

Once this is saved in a file, say called bench.sh, running the job is as simple as:

sbatch bench.sh

Slurm will return a job number, which can be used to track, account & cancel the job.

Monitoring Jobs

To see your current submitted & running jobs, we use the command squeue.

For example:

[test.user@cl1 imb]$ squeue 
                   JOBID PARTITION     NAME      USER ST TIME NODES NODELIST(REASON)
                   109     compute imb_benc test.user  R  0:49    2 ccs[0121-0122]
                   110     compute imb_benc test.user  R  3:29    8 cst[001-008]
                   113     compute imb_benc test.use  PD  0:00    8 (Resources)

In this case, there are three jobs present, two are running (109 and 110) and one is queued/pending (113) awaiting resources.

When will my job run?

SCW systems deploy a Fair Share scheduling policy to share resources in a fair manner between different users & groups. With the previous LSF based systems it was possible to view all queued and running jobs from all users and thus get a ‘feel’ as to how busy the systems were and when a job might run. Such visibility of total system state is no longer the case with Slurm as job data is private – hence one cannot get such a view of total system state.

However, Slurm is very good at planning the future state of the system by assessing user fair shares, job time limits and node availability – and so it can predict job start times with a good degree of accuracy. There are a number of ways to view this information, but we have installed an additional tool called ‘slurmtop’, which will present a view of your currently queued and running jobs, including a countdown to start for queued jobs.

Example ‘slurmtop’ output:

slurmtop

We see in the ‘slurmtop’ output:

  • the first line shows the current status of our usage of the cluster.
  • the second line shows the overall status of the compute nodes of the cluster.
  • the grid shows – with one character per processor – the compute nodes of the cluster, and will locate our jobs when they are running.
  • the job list shows our queued and running jobs, plus any that have completed within the last five minutes. Note that ‘Elapsed’ time column showing a negative ‘count-down’ to start-time for queued jobs.

It’s worth noting that the prediction of future runs is somewhat fluid due to the use of fair share and because jobs typically do not run to their maximum configured duration, however this is a good indication of system and job status.

 

Viewing Running Jobs

To get more detailed information on a running job, one can use sstat <jobid>.

By default this gives a verbose set of data. A more succinct output targeting memory usage can be obtained using some simple output formatting arguments:

sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize &lt;jobid&gt;

Example output:
[test.user@cstl001 imb]$ sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 113
       JobID   NTasks             Nodelist     MaxRSS  MaxVMSize     AveRSS  AveVMSize 
------------ -------- -------------------- ---------- ---------- ---------- ---------- 
113.0               8         cst[001-008]    464196K    982928K    300810K    851119K

Many different formatting options can be specified, see the man page for details.

Slurm writes standard error and standard out files in fairly real time. Thus, you can see job progress by looking at the job script specified stdout and stderr files at runtime.

Killing a Job

If you have a job running that you wish to cancel for some reason, it is very easy to terminate using the job id that is returned at submission and can be seen in squeue output. Slurm is particularly robust at removing running jobs.

[test.user@cstl001 imb]$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               122   compute imb_benc test.use PD       0:00      8 ccs[001-008]
               120   compute imb_benc test.use  R       0:17      8 ccs[001-008]
               121   compute imb_benc test.use  R       0:17      8 ccs[001-008]

[test.user@cstl001 imb]$ scancel 122

[test.user@cstl001 imb]$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               120   compute imb_benc test.use  R       0:17      8 ccs[001-008]
               121   compute imb_benc test.use  R       0:17      8 ccs[001-008]

If you wish to cancel all your running and queued jobs, then use:
scancel -u username

Completed Jobs

Once a job has completed – it is no longer visible in the output from squeue and the output files are completed – we can use a different command to get job statistics:

[test.user@cstl001 imb]$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
104           imb_bench    compute    scw1000         32  COMPLETED      0:0 
104.batch         batch               scw1000         32  COMPLETED      0:0 
104.0         pmi_proxy               scw1000          2  COMPLETED      0:0 
105           imb_bench    compute    scw1000         32    TIMEOUT      1:0 
105.batch         batch               scw1000         32  CANCELLED     0:15 
105.0         pmi_proxy               scw1000          8 CANCELLED+      0:9 
106           imb_bench    compute    scw1000         32 CANCELLED+      0:0 
106.batch         batch               scw1000         32  CANCELLED     0:15 
106.0         pmi_proxy               scw1000          8  COMPLETED      0:0

In this case, we see three separate complete jobs. Job 104 completed successfully. Job 105 ran over its time limit. Job 106 was cancelled by the user.

We also see that one submitted job has resulted in three accounted task steps different parts executed by the job. If a single job were to call mpirun multiple times, for example in dividing a job allocation in two or running one parallel task after another, then we would see multiple parallel task steps. This is because MPI directly interacts with Slurm to take advantage of faster task launching.

We can also format the output of sacct in a very similar way to sstat:

[test.user@cstl001 imb]$ sacct --format JobID,jobname,NTasks,AllocCPUS,CPUTime,Start,End
       JobID    JobName   NTasks  AllocCPUS    CPUTime               Start                 End 
------------ ---------- -------- ---------- ---------- ------------------- ------------------- 
104           imb_bench                  32   02:18:40 2015-07-21T11:03:14 2015-07-21T11:07:34 
104.batch         batch        1         32   02:18:40 2015-07-21T11:03:14 2015-07-21T11:07:34 
104.0         pmi_proxy        2          2   00:08:40 2015-07-21T11:03:14 2015-07-21T11:07:34 
105           imb_bench                  32   10:51:12 2015-07-21T11:15:44 2015-07-21T11:36:05 
105.batch         batch        1         32   10:51:12 2015-07-21T11:15:44 2015-07-21T11:36:05 
105.0         pmi_proxy        8          8   01:00:00 2015-07-21T11:15:45 2015-07-21T11:23:15 
106           imb_bench                  32   00:38:24 2015-07-21T11:40:53 2015-07-21T11:42:05 
106.batch         batch        1         32   00:39:28 2015-07-21T11:40:53 2015-07-21T11:42:07 
106.0         pmi_proxy        8          8   00:09:52 2015-07-21T11:40:54 2015-07-21T11:42:08


Again, the man pages for the Slurm commands should be referenced for a full set of possible output fields.

Specifying which project is running the job

If you are a member of multiple projects use the -A option to sbatch to choose which project is running the job. This will help ensure that accounting statistics are correct for each project.

If you are only in one project then you don’t have to do this.

sbatch -A scw1000 bench.sh

You can find a list of your project codes on the “Project Memberships” page on MySCW.

Example Jobs

Please see here for further information on the training tarball that provides a wide variety of example jobs.