{"id":38,"date":"2015-07-21T10:59:47","date_gmt":"2015-07-21T09:59:47","guid":{"rendered":"https:\/\/portal.supercomputing.wales\/?page_id=38"},"modified":"2018-09-21T12:05:13","modified_gmt":"2018-09-21T11:05:13","slug":"migrating-jobs","status":"publish","type":"page","link":"https:\/\/portal.supercomputing.wales\/index.php\/index\/slurm\/migrating-jobs\/","title":{"rendered":"More On Slurm Jobs &#038; migration"},"content":{"rendered":"<h3>Job Runtime Environment in Slurm<\/h3>\n<p>When a job is submitted, Slurm will store all environment variables in-place at submission time and replicate that environment on the first allocated node where the batch script actually runs.<\/p>\n<p>Additionally, at run time, Slurm will set a number of shell environment variables that relate to the job itself and can be used in the job run. The Slurm documentation&#8217;s manpage on\u00a0<em>sbatch<\/em> provides an exhaustive guide, but we highlight some useful ones here.<\/p>\n<h4><\/h4>\n<h4>#SBATCH Directives<\/h4>\n<p>In line with most batch schedulers, Slurm uses directives in submission scripts to specify job requirements and parameters for a job &#8211; the #SBATCH directives. Thus for an MPI task we might typically have:<br \/>\n<pre class=\"preserve-code-formatting\">#SBATCH -p compute\n#SBATCH -o runout.%J\n#SBATCH -e runerr.%J\n#SBATCH --job-name=mpijob\n#SBATCH -n 80\n#SBATCH --tasks-per-node=40\n#SBATCH --exclusive\n#SBATCH -t 0-12:00\n#SBATCH --mem-per-cpu=4000<\/pre><br \/>\nWalking through these:<\/p>\n\n<table id=\"tablepress-5\" class=\"tablepress tablepress-id-5\">\n<thead>\n<tr class=\"row-1\">\n\t<th class=\"column-1\">Slurm #SBATCH directive<\/th><th class=\"column-2\">Description<\/th>\n<\/tr>\n<\/thead>\n<tbody class=\"row-striping row-hover\">\n<tr class=\"row-2\">\n\t<td class=\"column-1\">#SBATCH --partition=compute<br \/>\nor<br \/>\n#SBATCH -p compute<\/td><td class=\"column-2\">In Slurm, jobs are submitted to 'partitions'.  Despite the naming difference, the concept is the same.<\/td>\n<\/tr>\n<tr class=\"row-3\">\n\t<td class=\"column-1\">#SBATCH --output=runout.%J<br \/>\nor<br \/>\n#SBATCH -o runout.%J<\/td><td class=\"column-2\">File for STDOUT from the job run to be stored in.  The '%J' to Slurm is replaced with the job number.<\/td>\n<\/tr>\n<tr class=\"row-4\">\n\t<td class=\"column-1\">#SBATCH --error=runerr.%J<br \/>\nor<br \/>\n#SBATCH -e runerr.%J<\/td><td class=\"column-2\">File for STDERR from the job run to be stored in. The '%J' to Slurm is replaced with the job number.<\/td>\n<\/tr>\n<tr class=\"row-5\">\n\t<td class=\"column-1\">#SBATCH --job-name=mpijob<\/td><td class=\"column-2\">Job name, useful for monitoring and setting up inter-job dependency.<\/td>\n<\/tr>\n<tr class=\"row-6\">\n\t<td class=\"column-1\">#SBATCH --ntasks=128<br \/>\nor<br \/>\n#SBATCH -n 128<\/td><td class=\"column-2\">Number of processors required for job.<\/td>\n<\/tr>\n<tr class=\"row-7\">\n\t<td class=\"column-1\">#SBATCH --tasks-per-node=16<\/td><td class=\"column-2\">The number of processors (tasks) to run per node.<\/td>\n<\/tr>\n<tr class=\"row-8\">\n\t<td class=\"column-1\">#SBATCH --exclusive<\/td><td class=\"column-2\">Exclusive job allocation - i.e. no other users on allocated nodes.<\/td>\n<\/tr>\n<tr class=\"row-9\">\n\t<td class=\"column-1\">#SBATCH --time=0-12:00<br \/>\nor<br \/>\n#SBATCH -t 0-12:00<\/td><td class=\"column-2\">Maximum runtime of job.  Note that it is beneficial to specify this and not leave it at the maximum as it will improve the chances of the scheduler 'back-filling' the job and running it earlier.<\/td>\n<\/tr>\n<tr class=\"row-10\">\n\t<td class=\"column-1\">#SBATCH --mem-per-cpu=4000 <\/td><td class=\"column-2\">Memory requirements of job.  Slurm's memory-based scheduling is more powerful than many schedulers.  <\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<!-- #tablepress-5 from cache -->\n<p>&nbsp;<\/p>\n<h4>Environment Variables<\/h4>\n<p>Once an allocation has been scheduled and a job script is started (on the first node of the allocation), Slurm sets a number of shell environment variables that can be used in the script at runtime. Below is a summary of some of the most useful:<\/p>\n\n<table id=\"tablepress-6\" class=\"tablepress tablepress-id-6\">\n<thead>\n<tr class=\"row-1\">\n\t<th class=\"column-1\">Slurm<\/th><th class=\"column-2\">Description<\/th>\n<\/tr>\n<\/thead>\n<tbody class=\"row-striping row-hover\">\n<tr class=\"row-2\">\n\t<td class=\"column-1\">$SLURM_JOBID<\/td><td class=\"column-2\">Job ID.<\/td>\n<\/tr>\n<tr class=\"row-3\">\n\t<td class=\"column-1\">$SLURM_JOB_NODELIST<\/td><td class=\"column-2\">Nodes allocated to the job i.e. with at least once task on.<\/td>\n<\/tr>\n<tr class=\"row-4\">\n\t<td class=\"column-1\">$SLURM_ARRAY_TASK_ID<\/td><td class=\"column-2\">If an array job, then the task index.<\/td>\n<\/tr>\n<tr class=\"row-5\">\n\t<td class=\"column-1\">$SLURM_JOB_NAME<\/td><td class=\"column-2\">Job name.<\/td>\n<\/tr>\n<tr class=\"row-6\">\n\t<td class=\"column-1\">$SLURM_JOB_PARTITION<\/td><td class=\"column-2\">Partition that the job was submitted to.<\/td>\n<\/tr>\n<tr class=\"row-7\">\n\t<td class=\"column-1\">$SLURM_JOB_NUM_NODES<\/td><td class=\"column-2\">Number of nodes allocated to this job.  <\/td>\n<\/tr>\n<tr class=\"row-8\">\n\t<td class=\"column-1\">$SLURM_NTASKS<\/td><td class=\"column-2\">Number of tasks (processes) allocated to this job.<\/td>\n<\/tr>\n<tr class=\"row-9\">\n\t<td class=\"column-1\">$SLURM_NTASKS_PER_NODE<br \/>\n(Only set if the --ntasks-per-node option is specified)<\/td><td class=\"column-2\">Number of tasks (processes) per node.<\/td>\n<\/tr>\n<tr class=\"row-10\">\n\t<td class=\"column-1\">$SLURM_SUBMIT_DIR<\/td><td class=\"column-2\">Directory in which job was submitted.<\/td>\n<\/tr>\n<tr class=\"row-11\">\n\t<td class=\"column-1\">$SLURM_SUBMIT_HOST<\/td><td class=\"column-2\">Host on which job was submitted.<\/td>\n<\/tr>\n<tr class=\"row-12\">\n\t<td class=\"column-1\">$SLURM_PROC_ID<\/td><td class=\"column-2\">The process (task) ID within the job. This will start from zero and go up to $SLURM_NTASKS-1.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<!-- #tablepress-6 from cache -->\n<p>&nbsp;<\/p>\n<h3>System Queues &amp; Partitions in Slurm<\/h3>\n<p>Please use the <em>sinfo <\/em>command to see the names of the partitions (queues) to use in your job scripts. If not specified, the default partition will be used for job submissions. <em>sinfo -s <\/em>will give a more succinct partition list. Please see the <a href=\"https:\/\/portal.supercomputing.wales\/index.php\/about-hawk\/\">Hawk<\/a> and <a href=\"https:\/\/portal.supercomputing.wales\/index.php\/about-sunbird\/\">Sunbird<\/a> pages for a list of partitions and their descriptions on each system.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Job Runtime Environment in Slurm When a job is submitted, Slurm will store all environment variables in-place at submission time and replicate that environment on the first allocated node where the batch script actually runs. Additionally, at run time, Slurm will set a number of shell environment variables that relate to the job itself and [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":33,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"page-nosidebar.php","meta":{"_lmt_disableupdate":"","_lmt_disable":"","footnotes":""},"class_list":["post-38","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/portal.supercomputing.wales\/index.php\/wp-json\/wp\/v2\/pages\/38","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/portal.supercomputing.wales\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/portal.supercomputing.wales\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/portal.supercomputing.wales\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/portal.supercomputing.wales\/index.php\/wp-json\/wp\/v2\/comments?post=38"}],"version-history":[{"count":30,"href":"https:\/\/portal.supercomputing.wales\/index.php\/wp-json\/wp\/v2\/pages\/38\/revisions"}],"predecessor-version":[{"id":873,"href":"https:\/\/portal.supercomputing.wales\/index.php\/wp-json\/wp\/v2\/pages\/38\/revisions\/873"}],"up":[{"embeddable":true,"href":"https:\/\/portal.supercomputing.wales\/index.php\/wp-json\/wp\/v2\/pages\/33"}],"wp:attachment":[{"href":"https:\/\/portal.supercomputing.wales\/index.php\/wp-json\/wp\/v2\/media?parent=38"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}