Checkpointing with DMTCP
Download from https://github.com/dmtcp/dmtcp/releases/download/2.6.0/dmtcp-2.6.0.zip
wget https://github.com/dmtcp/dmtcp/releases/download/2.6.0/dmtcp-2.6.0.zip unzip dmtcp-2.6.0.zip cd dmtcp-2.6.0 ./configure --prefix=$HOME/.local make make install make clean #if 32bit support is needed, most people won't need to do this and can stop at this point ./configure --prefix=$HOME/.local --enable-m32 make make install
Launch a program with dmtcp
dmtcp_launch -i <interval seconds> <program name>
This will save checkpoint files every <interval seconds> seconds. These will be called something like ckpt_<program name>_21ab6b541c71fec-40000-1a25dc019d9573.dmtcp
Relaunch a program
A script for automatic restarting called dmtcp_restart_script.sh will also be created. This script will check you are on the same host before, so isn’t best suited for HPC use.
A program can also be restarted by running:
dmtcp_relaunch -i <interval seconds> <checkpoint file>
where <checkpoint file> is the .dmtcp file that was saved.
This method allows relaunching on a different host.
Integrating into a Slurm script
#!/bin/bash --login ### #job name #SBATCH --job-name=count #job stdout file #SBATCH --output=count.out.%J #job stderr file #SBATCH --error=count.err.%J #maximum job time in D-HH:MM #SBATCH --time=0-00:01 #maximum memory of 10 megabytes #SBATCH --mem-per-cpu=10 #SBATCH --ntasks=1 #SBATCH --nodes=1 ### #cleanup any old dmtcp files from previous jobs rm *.dmtcp rm dmtcp_restart_script*.sh #launch the program count with dmtcp dmtcp_launch -i 10 ./count
Launch this with sbatch, after about one minute it will be terminated by Slurm.
A separate relaunch script is required:
#!/bin/bash --login ### #job name #SBATCH --job-name=count #job stdout file #SBATCH --output=count.out.%J #job stderr file #SBATCH --error=count.err.%J #maximum job time in D-HH:MM #SBATCH --time=0-00:01 #maximum memory of 10 megabytes #SBATCH --mem-per-cpu=10 #SBATCH --ntasks=1 #SBATCH --nodes=1 ### dmtcp_restart -i 10 ckpt_*.dmtcp
Now launch the relaunch script with sbatch and it will run for a further minute.
It can be subsequently relaunched again and again if needed.
Problems when processing continued after the last checkpoint
Some programs will write data to some output files between the last checkpoint and the process being terminated by the Slurm time limit. When the process is relaunched there can be a discrepancy between what is written on disk and what’s in the memory of the program. To work around this the checkpoint needs to be taken at exactly the same time the process exits.
Change the dmtcp_launch command to:
dmtcp_coordinator --exit-after-ckpt --daemon dmtcp_launch -i 259190 ./count
dmtcp_restart command to:
dmtcp_coordinator --exit-after-ckpt --daemon dmtcp_restart -i 259190 ckpt_*.dmtcp
This will take a check point after 2 days, 23 hours, 59 minute and 50 seconds and then terminate the process. You can then restart the program and the data on disk and in memory/snapshot will be consistent.