Distributed Checkpointing: Analysis and Benchmarks

Gustavo M. D. VieiraLuiz E. Buzato

This work proposes a metric for the analysis and benchmarking of checkpointing algorithms through simulation; the results obtained show that the metric is a good checkpoint overhead indicator. The metric is implemented by ChkSim, a simulator that has been used to compare 18 quasi-synchronous checkpointing algorithms. A survey of previous analyses of checkpointing shows our study to be the most comprehensive comparison carried out so far. ChkSim is easy to use and guarantees that the algorithms are fairly compared by subjecting all of them to exactly the same simulation events. The information summarized here can certainly be used to guide the construction of practical quasi-synchronous checkpoint-restart toolkits for modern clusters.

