{"id":830,"date":"2018-09-06T11:03:08","date_gmt":"2018-09-06T10:03:08","guid":{"rendered":"https:\/\/portal.supercomputing.wales\/?page_id=830"},"modified":"2021-11-17T18:44:22","modified_gmt":"2021-11-17T18:44:22","slug":"using-gpus","status":"publish","type":"page","link":"https:\/\/portal.supercomputing.wales\/index.php\/using-gpus\/","title":{"rendered":"Using GPUs"},"content":{"rendered":"<h3>GPU Usage<\/h3>\n<p>Slurm controls access to the GPUs on a node such that access is only granted when the resource is requested specifically.&nbsp; Slurm models GPUs as a Generic Resource (GRES), which is requested at job submission time via the following additional directive:<\/p>\n<pre>#SBATCH --gres=gpu:2<\/pre>\n<p>This directive requires Slurm to allocate two GPUs per allocated node, to not use nodes without GPUs and to grant access.&nbsp; SCW GPU nodes have two GPUs each.<\/p>\n<p>Jobs must also be submitted to the desired GPU-enabled nodes queue:<\/p>\n<pre>#SBATCH -p gpu # to request P100 GPUs<\/pre>Or\n<pre>#SBATCH -p gpu_v100 # to request V100 GPUs<\/pre>\n<p>It is then possible to use CUDA enabled applications or the CUDA toolkit modules themselves, modular environment examples being:<\/p>\n<pre>module load CUDA\/9.1<\/pre>\n<pre>module load gromacs\/2018.2-single-gpu<\/pre>\n<h3>CUDA Versions &amp; Hardware Differences<\/h3>\n<p>Multiple versions of the CUDA libraries are installed on SCW systems, as can always be seen by:<\/p>\n<pre>[b.iss03c@cl1 ~]$ module avail CUDA ---- \/apps\/modules\/libraries ----\nCUDA\/10.0 CUDA\/10.1 CUDA\/11.2 CUDA\/11.3 CUDA\/11.4 CUDA\/8.0 CUDA\/9.0 CUDA\/9.1 CUDA\/9.2<\/pre>\n<p>The GPU nodes always run the latest nvidia driver to support the latest installed version of CUDA and also offer backwards-compatabilitity with prior versions.<\/p>\n<p>However, <em>Pascal<\/em> generation nVidia Tesla cards (present in <strong>Hawk<\/strong>) which are supported in all installed versions of CUDA, but&nbsp;<em>Volta<\/em> generation nVidia Tesla cards (present in <strong>Hawk<\/strong> and <strong>Sunbird<\/strong>) are only supported in CUDA 9+.&nbsp; Codes that require CUDA 8, such as Amber 16, will not run on the <em>Volta<\/em> cards.<\/p>\n<p>Some important differences between <em>Pascal<\/em> and <em>Volta<\/em> nVidia Tesla cards:<\/p>\n\n<table id=\"tablepress-25\" class=\"tablepress tablepress-id-25\">\n<thead>\n<tr class=\"row-1\">\n\t<th class=\"column-1\">Characteristic<\/th><th class=\"column-2\">Volta<\/th><th class=\"column-3\">Pascal<\/th>\n<\/tr>\n<\/thead>\n<tbody class=\"row-hover\">\n<tr class=\"row-2\">\n\t<td class=\"column-1\">Tensor Cores<\/td><td class=\"column-2\">640<\/td><td class=\"column-3\">0<\/td>\n<\/tr>\n<tr class=\"row-3\">\n\t<td class=\"column-1\">Cuda Cores<\/td><td class=\"column-2\">5120<\/td><td class=\"column-3\">3584<\/td>\n<\/tr>\n<tr class=\"row-4\">\n\t<td class=\"column-1\">Memory (GB)<\/td><td class=\"column-2\">16<\/td><td class=\"column-3\">16<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<!-- #tablepress-25 from cache -->\n<p>Tensor cores are a new type of programmable core exclusive to GPUs based on the Volta architecture that run alongside standard CUDA cores. Tensor cores can accelerate mixed-precision matrix multiply and accumulate calculations in a single operation. This capability is specially significant for AI\/DL\/ML applications that rely on large matrix operations.<\/p>\n<h3>GPU Compute Modes<\/h3>\n<p>nVidia GPU cards can be operated in a number of Compute Modes.&nbsp; In short the difference is whether multiple processes (and, theoretically, users) can access (share) a GPU or if a GPU is exclusively bound to a single process.&nbsp; It is typically application-specific whether one or the other mode is needed, so please pay particular attention to example job scripts.&nbsp; GPUs on SCW systems default to &#8216;shared&#8217; mode.<\/p>\n<p>Users are able to set the Compute Mode of GPUs allocated to their job through a pair of helper scripts that should be called in a job script in the following manner:<\/p>\n<p>To set exclusive mode:<\/p>\n<pre>clush -w $SLURM_NODELIST \"sudo \/apps\/slurm\/gpuset_3_exclusive\"<\/pre>\n<p>And to set shared mode (although this is the default at the start of any job):<\/p>\n<pre>clush -w $SLURM_NODELIST \"sudo \/apps\/slurm\/gpuset_0_shared\"<\/pre>\n<p>To query the Compute Mode:<\/p>\n<pre>clush -w $SLURM_NODELIST \"nvidia-smi -q|grep Compute\"<\/pre>\n<p>In all cases above, sensible output will appear in the job output file.<\/p>\n<p>Additionally, as Slurm models the GPUs as a consumable resource that must be requested in their own right (i.e. not implicitly with processor\/node count), the default of the scheduler would be to <em>not<\/em> allocate the same GPU to multiple users or jobs &#8211; it would take some manual work to achieve this.<\/p> <p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>GPU Usage Slurm controls access to the GPUs on a node such that access is only granted when the resource is requested specifically.&nbsp; Slurm models GPUs as a Generic Resource (GRES), which is requested at job submission time via the following additional directive: #SBATCH &#8211;gres=gpu:2 This directive requires Slurm to allocate two GPUs per allocated [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_lmt_disableupdate":"no","_lmt_disable":"","footnotes":""},"class_list":["post-830","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/portal.supercomputing.wales\/index.php\/wp-json\/wp\/v2\/pages\/830","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/portal.supercomputing.wales\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/portal.supercomputing.wales\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/portal.supercomputing.wales\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/portal.supercomputing.wales\/index.php\/wp-json\/wp\/v2\/comments?post=830"}],"version-history":[{"count":10,"href":"https:\/\/portal.supercomputing.wales\/index.php\/wp-json\/wp\/v2\/pages\/830\/revisions"}],"predecessor-version":[{"id":1308,"href":"https:\/\/portal.supercomputing.wales\/index.php\/wp-json\/wp\/v2\/pages\/830\/revisions\/1308"}],"wp:attachment":[{"href":"https:\/\/portal.supercomputing.wales\/index.php\/wp-json\/wp\/v2\/media?parent=830"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}