Checkpointing with DMTCP

Download from https://github.com/dmtcp/dmtcp/releases/download/2.6.0/dmtcp-2.6.0.zip

wget https://github.com/dmtcp/dmtcp/releases/download/2.6.0/dmtcp-2.6.0.zip
unzip dmtcp-2.6.0.zip
cd dmtcp-2.6.0
./configure --prefix=$HOME/.local
make
make install
make clean
#if 32bit support is needed, most people won't need to do this and can stop at this point
./configure --prefix=$HOME/.local --enable-m32
make
make install

Launch a program with dmtcp

dmtcp_launch -i <interval seconds> <program name>

This will save checkpoint files every <interval seconds> seconds. These will be called something like ckpt_<program name>_21ab6b541c71fec-40000-1a25dc019d9573.dmtcp

Relaunch a program

A script for automatic restarting called dmtcp_restart_script.sh will also be created. This script will check you are on the same host before, so isn’t best suited for HPC use.

A program can also be restarted by running:

dmtcp_relaunch -i <interval seconds> <checkpoint file>

where <checkpoint file> is the .dmtcp file that was saved.

This method allows relaunching on a different host.

Integrating into a Slurm script

#!/bin/bash --login
###
#job name
#SBATCH --job-name=count
#job stdout file
#SBATCH --output=count.out.%J
#job stderr file
#SBATCH --error=count.err.%J
#maximum job time in D-HH:MM
#SBATCH --time=0-00:01
#maximum memory of 10 megabytes
#SBATCH --mem-per-cpu=10
#SBATCH --ntasks=1
#SBATCH --nodes=1
###

#cleanup any old dmtcp files from previous jobs
rm *.dmtcp
rm dmtcp_restart_script*.sh

#launch the program count with dmtcp
dmtcp_launch -i 10 ./count

Launch this with sbatch, after about one minute it will be terminated by Slurm.

A separate relaunch script is required:

#!/bin/bash --login
###
#job name
#SBATCH --job-name=count
#job stdout file
#SBATCH --output=count.out.%J
#job stderr file
#SBATCH --error=count.err.%J
#maximum job time in D-HH:MM
#SBATCH --time=0-00:01
#maximum memory of 10 megabytes
#SBATCH --mem-per-cpu=10
#SBATCH --ntasks=1
#SBATCH --nodes=1
###

dmtcp_restart -i 10 ckpt_*.dmtcp

Now launch the relaunch script with sbatch and it will run for a further minute.

It can be subsequently relaunched again and again if needed.

Problems when processing continued after the last checkpoint

Some programs will write data to some output files between the last checkpoint and the process being terminated by the Slurm time limit. When the process is relaunched there can be a discrepancy between what is written on disk and what’s in the memory of the program. To work around this the checkpoint needs to be taken at exactly the same time the process exits.

Change the dmtcp_launch command to:

dmtcp_coordinator --exit-after-ckpt --daemon

dmtcp_launch -i 259190 ./count

dmtcp_restart command to:
dmtcp_coordinator --exit-after-ckpt --daemon

dmtcp_restart -i 259190 ckpt_*.dmtcp

This will take a check point after 2 days, 23 hours, 59 minute and 50 seconds and then terminate the process. You can then restart the program and the data on disk and in memory/snapshot will be consistent.

Relaunching MPI applications