Submitting, Monitoring & Cancelling Jobs

Submitting a Job

Running jobs on SCW clusters using Slurm is a simple process. You need a job script, which defines the resource requirements for the job and then is a batch script of the commands to run. The resource requirements & other job parameters are specified on lines that start with a #SBATCH instruction, thus they are treated as comments by anything other than the Slurm scheduler. The batch of commands to run is just as if you were typing them at the command line.

For example, a simple script might look like:

#!/bin/bash --login
###
#job name
#SBATCH --job-name=imb_bench
#job stdout file
#SBATCH --output=bench.out.%J
#job stderr file
#SBATCH --error=bench.err.%J
#maximum job time in D-HH:MM
#SBATCH --time=0-00:20
#number of parallel processes (tasks) you are requesting - maps to MPI processes
#SBATCH --ntasks=80 
#memory per process in MB 
#SBATCH --mem-per-cpu=4000 
#tasks to run per node (change for hybrid OpenMP/MPI) 
#SBATCH --ntasks-per-node=40
###

#now run normal batch commands 
module load compiler/intel mpi/intel

#run Intel MPI Benchmarks with mpirun - will automatically pick up Slurm parallel environment
mpirun $MPI_HOME/intel64/bin/IMB-MPI1

The directives to Slurm are quite clear and self-descriptive. Of particular note is the memory specification – Slurm is very good at scheduling around and subsequently controlling job memory usage. Too low a memory request can result in a job crashing or being cancelled, but too high a value can result in a job waiting for longer than necessary.

Once this is saved in a file, say called bench.sh, running the job is as simple as:

sbatch bench.sh

Slurm will return a job number, which can be used to track, account & cancel the job.

Monitoring Jobs

To see your current submitted & running jobs, we use the command squeue.

For example:

[test.user@cl1 imb]$ squeue 
                   JOBID PARTITION     NAME      USER ST TIME NODES NODELIST(REASON)
                   109     compute imb_benc test.user  R  0:49    2 ccs[0121-0122]
                   110     compute imb_benc test.user  R  3:29    8 cst[001-008]
                   113     compute imb_benc test.user PD  0:00    8 (Resources)

In this case, there are three jobs present, two are running (109 and 110) and one is queued/pending (113) awaiting resources.

When will my job run?

SCW systems deploy a Fair Share scheduling policy to share resources in a fair manner between different users & groups. Slurm is very good at planning the future state of the system by assessing user fair shares, job time limits and node availability – and so it can predict job start times with a good degree of accuracy.

You can ask SLURM for an estimated time for when your queued jobs will start running. First use the squeue command to output the list of all your running/queued jobs. Select a job currently pending, indicated by the PD state and copy the JOBID. You can then use the following command to get some information about that job printed to the display. In this example the JOBID will be 7000393, but change this for your own JOBID:

[s.a.user@sl1 ~]$ scontrol show job 7000393
JobId=7000393 JobName=ssd_model_a
   UserId=s.a.user(5000000) GroupId=x.g.p.01(5000000) MCS_label=N/A
   Priority=8747 Nice=0 Account=scw1000 QOS=normal
   JobState=PENDING Reason=QOSMaxCpuPerUserLimit Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2021-11-17T19:27:01 EligibleTime=2021-11-17T19:27:01
   AccrueTime=2021-11-17T19:27:01
   StartTime=2021-11-18T20:35:27 EndTime=2021-11-20T06:35:27 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-11-17T19:36:18
   Partition=compute AllocNode:Sid=sl1:249377
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=15 NumCPUs=600 NumTasks=600 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=600,mem=2400000M,node=1,billing=600
   Socks/Node=* NtasksPerN:B:S:C=40:0:*:* CoreSpec=*
   MinCPUsNode=40 MinMemoryCPU=4000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/s.a.user/test_data/submit.sh
   WorkDir=/home/s.a.user/test_data/
   StdErr=/home/s.a.user/test_data/ssd_model_a.err.%J
   StdIn=/dev/null
   StdOut=/home/s.a.user/test_data/ssd_model_a.out.%J
   Power=

The StartTime value provides an estimated time when SLURM thinks the resources will become available to start your job. In this example it’s given as: 2021-11-18T20:35:27. However, it’s worth noting that this estimation is somewhat fluid due to the use of fair share and because jobs typically do not run to their maximum configured duration. Therefore the job could start earlier/later than estimated, and will change depending on the current system usage. It does, however, provide a good indication of system and job status.

Viewing Running Jobs

To get more detailed information on a running job, one can use sstat <jobid>.

By default this gives a verbose set of data. A more succinct output targeting memory usage can be obtained using some simple output formatting arguments:

sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize &lt;jobid&gt;

Example output:

[test.user@cstl001 imb]$ sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 113
       JobID   NTasks             Nodelist     MaxRSS  MaxVMSize     AveRSS  AveVMSize 
------------ -------- -------------------- ---------- ---------- ---------- ---------- 
113.0               8         cst[001-008]    464196K    982928K    300810K    851119K

Many different formatting options can be specified, see the man page for details.

Slurm writes standard error and standard out files in fairly real time. Thus, you can see job progress by looking at the job script specified stdout and stderr files at runtime.

Killing a Job

If you have a job running that you wish to cancel for some reason, it is very easy to terminate using the job id that is returned at submission and can be seen in squeue output. Slurm is particularly robust at removing running jobs.

[test.user@cstl001 imb]$ squeue 
                 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                   122   compute imb_benc test.use PD       0:00      8 ccs[001-008]
                   120   compute imb_benc test.use  R       0:17      8 ccs[001-008]
                   121   compute imb_benc test.use  R       0:17      8 ccs[001-008]

[test.user@cstl001 imb]$ scancel 122

[test.user@cstl001 imb]$ squeue 
                 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                   120   compute imb_benc test.use  R       0:17      8 ccs[001-008]
                   121   compute imb_benc test.use  R       0:17      8 ccs[001-008]

If you wish to cancel all your running and queued jobs, then use:

scancel -u username

Completed Jobs

Once a job has completed – it is no longer visible in the output from squeue and the output files are completed – we can use a different command to get job statistics:

[test.user@cstl001 imb]$ sacct
           JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
    ------------ ---------- ---------- ---------- ---------- ---------- --------
    104           imb_bench    compute    scw1000     32      COMPLETED      0:0
    104.batch         batch               scw1000     32      COMPLETED      0:0
    104.0         pmi_proxy               scw1000      2      COMPLETED      0:0
    105           imb_bench    compute    scw1000     32        TIMEOUT      1:0
    105.batch         batch               scw1000     32      CANCELLED     0:15
    105.0         pmi_proxy               scw1000      8      CANCELLED+     0:9
    106           imb_bench    compute    scw1000     32      CANCELLED+     0:0
    106.batch         batch               scw1000     32      CANCELLED     0:15
    106.0         pmi_proxy               scw1000      8      COMPLETED      0:0

In this case, we see three separate complete jobs. Job 104 completed successfully. Job 105 ran over its time limit. Job 106 was cancelled by the user.

We also see that one submitted job has resulted in three accounted task steps different parts executed by the job. If a single job were to call mpirun multiple times, for example in dividing a job allocation in two or running one parallel task after another, then we would see multiple parallel task steps. This is because MPI directly interacts with Slurm to take advantage of faster task launching.

We can also format the output of sacct in a very similar way to sstat:

[test.user@cstl001 imb]$ sacct --format JobID,jobname,NTasks,AllocCPUS,CPUTime,Start,End
       JobID    JobName   NTasks  AllocCPUS    CPUTime               Start         End 
------------ ---------- -------- ---------- ---------- ------------------- ------------------- 
104           imb_bench       32              02:18:40 2015-07-21T11:03:14 2015-07-21T11:07:34 
104.batch         batch        1         32   02:18:40 2015-07-21T11:03:14 2015-07-21T11:07:34 
104.0         pmi_proxy        2          2   00:08:40 2015-07-21T11:03:14 2015-07-21T11:07:34 
105           imb_bench       32              10:51:12 2015-07-21T11:15:44 2015-07-21T11:36:05 
105.batch         batch        1         32   10:51:12 2015-07-21T11:15:44 2015-07-21T11:36:05 
105.0         pmi_proxy        8          8   01:00:00 2015-07-21T11:15:45 2015-07-21T11:23:15 
106           imb_bench       32              00:38:24 2015-07-21T11:40:53 2015-07-21T11:42:05 
106.batch         batch        1         32   00:39:28 2015-07-21T11:40:53 2015-07-21T11:42:07 
106.0         pmi_proxy        8          8   00:09:52 2015-07-21T11:40:54 2015-07-21T11:42:08

Again, the man pages for the Slurm commands should be referenced for a full set of possible output fields.

Specifying which project is running the job

If you are a member of multiple projects use the -A option to sbatch to choose which project is running the job. This will help ensure that accounting statistics are correct for each project.

If you are only in one project then you don’t have to do this.

sbatch -A scw1000 bench.sh

You can find a list of your project codes on the “Project Memberships” page on MySCW.

Example Jobs

Please see here for further information on the training tarball that provides a wide variety of example jobs.