文章基本信息

标题：Optimizing Checkpoint Restart with Data Deduplication
本地全文：下载
作者：Zhengyu Chen ; Jianhua Sun ; Hao Chen 等
期刊名称：Scientific Programming
印刷版ISSN：1058-9244
出版年度：2016
卷号：2016
DOI：10.1155/2016/9315493
出版社：Hindawi Publishing Corporation
摘要：The increasing scale, such as the size and complexity, of computer systems brings more frequent occurrences of hardware or software faults; thus fault-tolerant techniques become an essential component in high-performance computing systems. In order to achieve the goal of tolerating runtime faults, checkpoint restart is a typical and widely used method. However, the exploding sizes of checkpoint files that need to be saved to external storage pose a major scalability challenge, necessitating the design of efficient approaches to reducing the amount of checkpointing data. In this paper, we first motivate the need of redundancy elimination with a detailed analysis of checkpoint data from real scenarios. Based on the analysis, we apply inline data deduplication to achieve the objective of reducing checkpoint size. We use DMTCP, an open-source checkpoint restart package, to validate our method. Our experiment shows that, by using our method, single-computer programs can reduce the size of checkpoint file by 20% and distributed programs can reduce the size of checkpoint file by 47%.