Submitting a Job
Running jobs on SCW clusters using Slurm is a simple process. You need a job script, which defines the resource requirements for the job and then is a batch script of the commands to run. The resource requirements & other job parameters are specified on lines that start with a #SBATCH instruction, thus they are treated as comments by anything other than the Slurm scheduler. The batch of commands to run is just as if you were typing them at the command line.
For example, a simple script might look like:
#!/bin/bash --login ### #job name #SBATCH --job-name=imb_bench #job stdout file #SBATCH --output=bench.out.%J #job stderr file #SBATCH --error=bench.err.%J #maximum job time in D-HH:MM #SBATCH --time=0-00:20 #number of parallel processes (tasks) you are requesting - maps to MPI processes #SBATCH --ntasks=80 #memory per process in MB #SBATCH --mem-per-cpu=4000 #tasks to run per node (change for hybrid OpenMP/MPI) #SBATCH --ntasks-per-node=40 ### #now run normal batch commands module load compiler/intel mpi/intel #run Intel MPI Benchmarks with mpirun - will automatically pick up Slurm parallel environment mpirun $MPI_HOME/intel64/bin/IMB-MPI1
The directives to Slurm are quite clear and self-descriptive. Of particular note is the memory specification – Slurm is very good at scheduling around and subsequently controlling job memory usage. Too low a memory request can result in a job crashing or being cancelled, but too high a value can result in a job waiting for longer than necessary.
Once this is saved in a file, say called bench.sh, running the job is as simple as:
Slurm will return a job number, which can be used to track, account & cancel the job.
To see your current submitted & running jobs, we use the command squeue.
[test.user@cl1 imb]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 109 compute imb_benc test.user R 0:49 2 ccs[0121-0122] 110 compute imb_benc test.user R 3:29 8 cst[001-008] 113 compute imb_benc test.user PD 0:00 8 (Resources)
When will my job run?
SCW systems deploy a Fair Share scheduling policy to share resources in a fair manner between different users & groups. With the previous LSF based systems it was possible to view all queued and running jobs from all users and thus get a ‘feel’ as to how busy the systems were and when a job might run. Such visibility of total system state is no longer the case with Slurm as job data is private – hence one cannot get such a view of total system state.
However, Slurm is very good at planning the future state of the system by assessing user fair shares, job time limits and node availability – and so it can predict job start times with a good degree of accuracy. A broad overview of the system’s free/in-use resources can be displayed using the slurmtop command which will also display all current queued and running jobs.
Example ‘slurmtop’ output:
slurmtop - Wed Nov 17 18:44:44 2021 - 22 users, 2 starving Jobs: 104 total, 65 running, 39 waiting, 0 suspended Nodes: 138 total, 118 allocated, 13 idle, 7 down, 48438 watts, 516750914 joules consumed CPUs: 5648 total, 4449 allocated load 78.77% Memory: 26361878m allocated, 51026598m free USER ACCOUNTS NB JOBS NB NODES PARTITIONS s.a.user scw1000 5 40 compute x.t.u.01 scw1001 4 19 compute x.t.u.31 scw1070 1 1 gpu s.999111 scw1398 2 2 gpu s.999111 scw1398 3 pending compute x.t.u.01 scw1001 18 pending compute
We see in the ‘slurmtop’ output:
- Jobs: shows the total jobs in the queue, with a breakdown of their state (running/waiting/suspended).
- Nodes: shows the total number of nodes on the system, with a breakdown of their state (allocated/idle/down).
- CPUs: shows the number of CPUs on the system, the number that are allocated, and the current load on the system
- Memory: shows the amount of allocated versus free memory on the system.
- Following this is a list of jobs currently running or pending, and shows:
- USER: who submitted the job.
- ACCOUNTS: the projectID the job will be registered against, in form scwXXXX
- NB JOBS: the number of jobs the user has running or pending, depending on the job’s state.
- NB NODES: the number of nodes user is using across all their running jobs.
- PARTITIONS: the partition where the running/pending job will be executed, typically compute or gpu, but there are various partitions available to users.
You can ask SLURM for an estimated time for when your queued jobs will start running. First use the squeue command to output the list of all your running/queued jobs. Select a job currently pending, indicated by the PD state and copy the JOBID. You can then use the following command to get some information about that job printed to the display. In this example the JOBID will be 7000393, but change this for your own JOBID:
[s.a.user@sl1 ~]$ scontrol show job 7000393 JobId=7000393 JobName=ssd_model_a UserId=s.a.user(5000000) GroupId=x.g.p.01(5000000) MCS_label=N/A Priority=8747 Nice=0 Account=scw1000 QOS=normal JobState=PENDING Reason=QOSMaxCpuPerUserLimit Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2021-11-17T19:27:01 EligibleTime=2021-11-17T19:27:01 AccrueTime=2021-11-17T19:27:01 StartTime=2021-11-18T20:35:27 EndTime=2021-11-20T06:35:27 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-11-17T19:36:18 Partition=compute AllocNode:Sid=sl1:249377 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=15 NumCPUs=600 NumTasks=600 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=600,mem=2400000M,node=1,billing=600 Socks/Node=* NtasksPerN:B:S:C=40:0:*:* CoreSpec=* MinCPUsNode=40 MinMemoryCPU=4000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/s.a.user/test_data/submit.sh WorkDir=/home/s.a.user/test_data/ StdErr=/home/s.a.user/test_data/ssd_model_a.err.%J StdIn=/dev/null StdOut=/home/s.a.user/test_data/ssd_model_a.out.%J Power=
The StartTime value provides an estimated time when SLURM thinks the resources will become available to start your job. In this example it’s given as: 2021-11-18T20:35:27. However, it’s worth noting that this estimation is somewhat fluid due to the use of fair share and because jobs typically do not run to their maximum configured duration. Therefore the job could start earlier/later than estimated, and will change depending on the current system usage. It does, however, provide a good indication of system and job status.
Viewing Running Jobs
To get more detailed information on a running job, one can use sstat <jobid>.
By default this gives a verbose set of data. A more succinct output targeting memory usage can be obtained using some simple output formatting arguments:
sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize <jobid>
[test.user@cstl001 imb]$ sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 113 JobID NTasks Nodelist MaxRSS MaxVMSize AveRSS AveVMSize ------------ -------- -------------------- ---------- ---------- ---------- ---------- 113.0 8 cst[001-008] 464196K 982928K 300810K 851119K
Many different formatting options can be specified, see the man page for details.
Slurm writes standard error and standard out files in fairly real time. Thus, you can see job progress by looking at the job script specified stdout and stderr files at runtime.
Killing a Job
If you have a job running that you wish to cancel for some reason, it is very easy to terminate using the job id that is returned at submission and can be seen in squeue output. Slurm is particularly robust at removing running jobs.
[test.user@cstl001 imb]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 122 compute imb_benc test.use PD 0:00 8 ccs[001-008] 120 compute imb_benc test.use R 0:17 8 ccs[001-008] 121 compute imb_benc test.use R 0:17 8 ccs[001-008] [test.user@cstl001 imb]$ scancel 122 [test.user@cstl001 imb]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 120 compute imb_benc test.use R 0:17 8 ccs[001-008] 121 compute imb_benc test.use R 0:17 8 ccs[001-008]
If you wish to cancel all your running and queued jobs, then use:
scancel -u username
Once a job has completed – it is no longer visible in the output from squeue and the output files are completed – we can use a different command to get job statistics:
[test.user@cstl001 imb]$ sacct JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 104 imb_bench compute scw1000 32 COMPLETED 0:0 104.batch batch scw1000 32 COMPLETED 0:0 104.0 pmi_proxy scw1000 2 COMPLETED 0:0 105 imb_bench compute scw1000 32 TIMEOUT 1:0 105.batch batch scw1000 32 CANCELLED 0:15 105.0 pmi_proxy scw1000 8 CANCELLED+ 0:9 106 imb_bench compute scw1000 32 CANCELLED+ 0:0 106.batch batch scw1000 32 CANCELLED 0:15 106.0 pmi_proxy scw1000 8 COMPLETED 0:0
In this case, we see three separate complete jobs. Job 104 completed successfully. Job 105 ran over its time limit. Job 106 was cancelled by the user.
We also see that one submitted job has resulted in three accounted task steps different parts executed by the job. If a single job were to call mpirun multiple times, for example in dividing a job allocation in two or running one parallel task after another, then we would see multiple parallel task steps. This is because MPI directly interacts with Slurm to take advantage of faster task launching.
We can also format the output of sacct in a very similar way to sstat:
[test.user@cstl001 imb]$ sacct --format JobID,jobname,NTasks,AllocCPUS,CPUTime,Start,End JobID JobName NTasks AllocCPUS CPUTime Start End ------------ ---------- -------- ---------- ---------- ------------------- ------------------- 104 imb_bench 32 02:18:40 2015-07-21T11:03:14 2015-07-21T11:07:34 104.batch batch 1 32 02:18:40 2015-07-21T11:03:14 2015-07-21T11:07:34 104.0 pmi_proxy 2 2 00:08:40 2015-07-21T11:03:14 2015-07-21T11:07:34 105 imb_bench 32 10:51:12 2015-07-21T11:15:44 2015-07-21T11:36:05 105.batch batch 1 32 10:51:12 2015-07-21T11:15:44 2015-07-21T11:36:05 105.0 pmi_proxy 8 8 01:00:00 2015-07-21T11:15:45 2015-07-21T11:23:15 106 imb_bench 32 00:38:24 2015-07-21T11:40:53 2015-07-21T11:42:05 106.batch batch 1 32 00:39:28 2015-07-21T11:40:53 2015-07-21T11:42:07 106.0 pmi_proxy 8 8 00:09:52 2015-07-21T11:40:54 2015-07-21T11:42:08
Again, the man pages for the Slurm commands should be referenced for a full set of possible output fields.
Specifying which project is running the job
If you are a member of multiple projects use the -A option to sbatch to choose which project is running the job. This will help ensure that accounting statistics are correct for each project.
If you are only in one project then you don’t have to do this.
sbatch -A scw1000 bench.sh
Please see here for further information on the training tarball that provides a wide variety of example jobs.