{"id":926,"date":"2020-04-23T13:06:54","date_gmt":"2020-04-23T12:06:54","guid":{"rendered":"https:\/\/portal.supercomputing.wales\/?page_id=926"},"modified":"2020-05-15T18:25:43","modified_gmt":"2020-05-15T17:25:43","slug":"checkpointing-with-dmtcp","status":"publish","type":"page","link":"https:\/\/portal.supercomputing.wales\/index.php\/checkpointing-with-dmtcp\/","title":{"rendered":"Checkpointing with DMTCP"},"content":{"rendered":"<p>Download from <a href=\"https:\/\/github.com\/dmtcp\/dmtcp\/releases\/download\/2.6.0\/dmtcp-2.6.0.zip\">https:\/\/github.com\/dmtcp\/dmtcp\/releases\/download\/2.6.0\/dmtcp-2.6.0.zip<\/a><br \/>\n<pre class=\"preserve-code-formatting\">wget https:\/\/github.com\/dmtcp\/dmtcp\/releases\/download\/2.6.0\/dmtcp-2.6.0.zip\nunzip dmtcp-2.6.0.zip\ncd dmtcp-2.6.0\n.\/configure --prefix=$HOME\/.local\nmake\nmake install\nmake clean\n#if 32bit support is needed, most people won&#039;t need to do this and can stop at this point\n.\/configure --prefix=$HOME\/.local --enable-m32\nmake\nmake install<\/pre><\/p>\n<h2>Launch a program with dmtcp<\/h2>\n<p>dmtcp_launch -i &lt;interval seconds&gt; &lt;program name&gt;<\/p>\n<p>This will save checkpoint files every &lt;interval seconds&gt; seconds. These will be called something like ckpt_&lt;program name&gt;_21ab6b541c71fec-40000-1a25dc019d9573.dmtcp<\/p>\n<h2>Relaunch a program<\/h2>\n<p>A script for automatic restarting called dmtcp_restart_script.sh will also be created. This script will check you are on the same host before, so isn&#8217;t best suited for HPC use.<\/p>\n<p>A program can also be restarted by running:<\/p>\n<p>dmtcp_relaunch -i &lt;interval seconds&gt; &lt;checkpoint file&gt;<\/p>\n<p>where &lt;checkpoint file&gt; is the .dmtcp file that was saved.<\/p>\n<p>This method allows relaunching on a different host.<\/p>\n<p>Integrating into a Slurm script<br \/>\n<pre class=\"preserve-code-formatting\">#!\/bin\/bash --login\n###\n#job name\n#SBATCH --job-name=count\n#job stdout file\n#SBATCH --output=count.out.%J\n#job stderr file\n#SBATCH --error=count.err.%J\n#maximum job time in D-HH:MM\n#SBATCH --time=0-00:01\n#maximum memory of 10 megabytes\n#SBATCH --mem-per-cpu=10\n#SBATCH --ntasks=1\n#SBATCH --nodes=1\n###\n\n#cleanup any old dmtcp files from previous jobs\nrm *.dmtcp\nrm dmtcp_restart_script*.sh\n\n#launch the program count with dmtcp\ndmtcp_launch -i 10 .\/count<\/pre><br \/>\nLaunch this with sbatch, after about one minute it will be terminated by Slurm.<\/p>\n<p>A separate relaunch script is required:<br \/>\n<pre class=\"preserve-code-formatting\">#!\/bin\/bash --login\n###\n#job name\n#SBATCH --job-name=count\n#job stdout file\n#SBATCH --output=count.out.%J\n#job stderr file\n#SBATCH --error=count.err.%J\n#maximum job time in D-HH:MM\n#SBATCH --time=0-00:01\n#maximum memory of 10 megabytes\n#SBATCH --mem-per-cpu=10\n#SBATCH --ntasks=1\n#SBATCH --nodes=1\n###\n\ndmtcp_restart -i 10 ckpt_*.dmtcp<\/pre><br \/>\nNow launch the relaunch script with sbatch and it will run for a further minute.<\/p>\n<p>It can be subsequently relaunched again and again if needed.<\/p>\n<h2>Problems when processing continued after the last checkpoint<\/h2>\n<p>Some programs will write data to some output files between the last checkpoint and the process being terminated by the Slurm time limit. When the process is relaunched there can be a discrepancy between what is written on disk and what&#8217;s in the memory of the program. To work around this the checkpoint needs to be taken at exactly the same time the process exits.<\/p>\n<p>Change the dmtcp_launch command to:<br \/>\n<pre class=\"preserve-code-formatting\">dmtcp_coordinator --exit-after-ckpt --daemon\n\ndmtcp_launch -i 259190 .\/count<\/pre><br \/>\ndmtcp_restart command to:<br \/>\n<pre class=\"preserve-code-formatting\">dmtcp_coordinator --exit-after-ckpt --daemon\n\ndmtcp_restart -i 259190 ckpt_*.dmtcp<\/pre><br \/>\nThis will take a check point after 2 days, 23 hours, 59 minute and 50 seconds and then terminate the process. You can then restart the program and the data on disk and in memory\/snapshot will be consistent.<\/p>\n<h2>Relaunching MPI applications<\/h2>\n","protected":false},"excerpt":{"rendered":"<p>Download from https:\/\/github.com\/dmtcp\/dmtcp\/releases\/download\/2.6.0\/dmtcp-2.6.0.zip wget https:\/\/github.com\/dmtcp\/dmtcp\/releases\/download\/2.6.0\/dmtcp-2.6.0.zip unzip dmtcp-2.6.0.zip cd dmtcp-2.6.0 .\/configure &#8211;prefix=$HOME\/.local make make install make clean #if 32bit support is needed, most people won&#039;t need to do this and can stop at this point .\/configure &#8211;prefix=$HOME\/.local &#8211;enable-m32 make make install Launch a program with dmtcp dmtcp_launch -i &lt;interval seconds&gt; &lt;program name&gt; This will save checkpoint [&hellip;]<\/p>\n","protected":false},"author":8,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_lmt_disableupdate":"no","_lmt_disable":"","footnotes":""},"class_list":["post-926","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/portal.supercomputing.wales\/index.php\/wp-json\/wp\/v2\/pages\/926","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/portal.supercomputing.wales\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/portal.supercomputing.wales\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/portal.supercomputing.wales\/index.php\/wp-json\/wp\/v2\/users\/8"}],"replies":[{"embeddable":true,"href":"https:\/\/portal.supercomputing.wales\/index.php\/wp-json\/wp\/v2\/comments?post=926"}],"version-history":[{"count":4,"href":"https:\/\/portal.supercomputing.wales\/index.php\/wp-json\/wp\/v2\/pages\/926\/revisions"}],"predecessor-version":[{"id":1160,"href":"https:\/\/portal.supercomputing.wales\/index.php\/wp-json\/wp\/v2\/pages\/926\/revisions\/1160"}],"wp:attachment":[{"href":"https:\/\/portal.supercomputing.wales\/index.php\/wp-json\/wp\/v2\/media?parent=926"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}