Submitting a Job
Running jobs on SCW clusters using Slurm is a simple process. You need a job script, which defines the resource requirements for the job and then is a batch script of the commands to run. The resource requirements & other job parameters are specified on lines that start with a #SBATCH instruction, thus they are treated as comments by anything other than the Slurm scheduler. The batch of commands to run is just as if you were typing them at the command line.
For example, a simple script might look like:
#!/bin/bash --login ### #job name #SBATCH --job-name=imb_bench #job stdout file #SBATCH --output=bench.out.%J #job stderr file #SBATCH --error=bench.err.%J #maximum job time in D-HH:MM #SBATCH --time=0-00:20 #number of parallel processes (tasks) you are requesting - maps to MPI processes #SBATCH --ntasks=80 #memory per process in MB #SBATCH --mem-per-cpu=4000 #tasks to run per node (change for hybrid OpenMP/MPI) #SBATCH --ntasks-per-node=40 ### #now run normal batch commands module load compiler/intel mpi/intel #run Intel MPI Benchmarks with mpirun - will automatically pick up Slurm parallel environment mpirun $MPI_HOME/intel64/bin/IMB-MPI1
The directives to Slurm are quite clear and self-descriptive. Of particular note is the memory specification – Slurm is very good at scheduling around and subsequently controlling job memory usage. Too low a memory request can result in a job crashing or being cancelled, but too high a value can result in a job waiting for longer than necessary.
Once this is saved in a file, say called bench.sh, running the job is as simple as:
Slurm will return a job number, which can be used to track, account & cancel the job.
To see your current submitted & running jobs, we use the command squeue.
[test.user@cl1 imb]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 109 compute imb_benc test.user R 0:49 2 ccs[0121-0122] 110 compute imb_benc test.user R 3:29 8 cst[001-008] 113 compute imb_benc test.use PD 0:00 8 (Resources)
In this case, there are three jobs present, two are running (109 and 110) and one is queued/pending (113) awaiting resources.
When will my job run?
SCW systems deploy a Fair Share scheduling policy to share resources in a fair manner between different users & groups. With the previous LSF based systems it was possible to view all queued and running jobs from all users and thus get a ‘feel’ as to how busy the systems were and when a job might run. Such visibility of total system state is no longer the case with Slurm as job data is private – hence one cannot get such a view of total system state.
However, Slurm is very good at planning the future state of the system by assessing user fair shares, job time limits and node availability – and so it can predict job start times with a good degree of accuracy. There are a number of ways to view this information, but we have installed an additional tool called ‘slurmtop’, which will present a view of your currently queued and running jobs, including a countdown to start for queued jobs.
Example ‘slurmtop’ output:
We see in the ‘slurmtop’ output:
- the first line shows the current status of our usage of the cluster.
- the second line shows the overall status of the compute nodes of the cluster.
- the grid shows – with one character per processor – the compute nodes of the cluster, and will locate our jobs when they are running.
- the job list shows our queued and running jobs, plus any that have completed within the last five minutes. Note that ‘Elapsed’ time column showing a negative ‘count-down’ to start-time for queued jobs.
It’s worth noting that the prediction of future runs is somewhat fluid due to the use of fair share and because jobs typically do not run to their maximum configured duration, however this is a good indication of system and job status.
Viewing Running Jobs
To get more detailed information on a running job, one can use sstat <jobid>.
By default this gives a verbose set of data. A more succinct output targeting memory usage can be obtained using some simple output formatting arguments:
sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize <jobid>
[test.user@cstl001 imb]$ sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 113 JobID NTasks Nodelist MaxRSS MaxVMSize AveRSS AveVMSize ------------ -------- -------------------- ---------- ---------- ---------- ---------- 113.0 8 cst[001-008] 464196K 982928K 300810K 851119K
Many different formatting options can be specified, see the man page for details.
Slurm writes standard error and standard out files in fairly real time. Thus, you can see job progress by looking at the job script specified stdout and stderr files at runtime.
Killing a Job
If you have a job running that you wish to cancel for some reason, it is very easy to terminate using the job id that is returned at submission and can be seen in squeue output. Slurm is particularly robust at removing running jobs.
[test.user@cstl001 imb]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 122 compute imb_benc test.use PD 0:00 8 ccs[001-008] 120 compute imb_benc test.use R 0:17 8 ccs[001-008] 121 compute imb_benc test.use R 0:17 8 ccs[001-008] [test.user@cstl001 imb]$ scancel 122 [test.user@cstl001 imb]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 120 compute imb_benc test.use R 0:17 8 ccs[001-008] 121 compute imb_benc test.use R 0:17 8 ccs[001-008]
If you wish to cancel all your running and queued jobs, then use:
scancel -u username
Once a job has completed – it is no longer visible in the output from squeue and the output files are completed – we can use a different command to get job statistics:
[test.user@cstl001 imb]$ sacct JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 104 imb_bench compute scw1000 32 COMPLETED 0:0 104.batch batch scw1000 32 COMPLETED 0:0 104.0 pmi_proxy scw1000 2 COMPLETED 0:0 105 imb_bench compute scw1000 32 TIMEOUT 1:0 105.batch batch scw1000 32 CANCELLED 0:15 105.0 pmi_proxy scw1000 8 CANCELLED+ 0:9 106 imb_bench compute scw1000 32 CANCELLED+ 0:0 106.batch batch scw1000 32 CANCELLED 0:15 106.0 pmi_proxy scw1000 8 COMPLETED 0:0
In this case, we see three separate complete jobs. Job 104 completed successfully. Job 105 ran over its time limit. Job 106 was cancelled by the user.
We also see that one submitted job has resulted in three accounted task steps different parts executed by the job. If a single job were to call mpirun multiple times, for example in dividing a job allocation in two or running one parallel task after another, then we would see multiple parallel task steps. This is because MPI directly interacts with Slurm to take advantage of faster task launching.
We can also format the output of sacct in a very similar way to sstat:
[test.user@cstl001 imb]$ sacct --format JobID,jobname,NTasks,AllocCPUS,CPUTime,Start,End JobID JobName NTasks AllocCPUS CPUTime Start End ------------ ---------- -------- ---------- ---------- ------------------- ------------------- 104 imb_bench 32 02:18:40 2015-07-21T11:03:14 2015-07-21T11:07:34 104.batch batch 1 32 02:18:40 2015-07-21T11:03:14 2015-07-21T11:07:34 104.0 pmi_proxy 2 2 00:08:40 2015-07-21T11:03:14 2015-07-21T11:07:34 105 imb_bench 32 10:51:12 2015-07-21T11:15:44 2015-07-21T11:36:05 105.batch batch 1 32 10:51:12 2015-07-21T11:15:44 2015-07-21T11:36:05 105.0 pmi_proxy 8 8 01:00:00 2015-07-21T11:15:45 2015-07-21T11:23:15 106 imb_bench 32 00:38:24 2015-07-21T11:40:53 2015-07-21T11:42:05 106.batch batch 1 32 00:39:28 2015-07-21T11:40:53 2015-07-21T11:42:07 106.0 pmi_proxy 8 8 00:09:52 2015-07-21T11:40:54 2015-07-21T11:42:08
Again, the man pages for the Slurm commands should be referenced for a full set of possible output fields.
Specifying which project is running the job
If you are a member of multiple projects use the -A option to sbatch to choose which project is running the job. This will help ensure that accounting statistics are correct for each project.
If you are only in one project then you don’t have to do this.
sbatch -A scw1000 bench.sh
You can find a list of your project codes on the “Project Memberships” page on MySCW.
Please see here for further information on the training tarball that provides a wide variety of example jobs.