出版社:Academy & Industry Research Collaboration Center (AIRCC)
摘要:The processes of the distributed system considered in this paper use loosely synchronized clocks. The paper describes a method of taking checkpoints by such processes in a truly distributed manner, that is, in the absence of a global checkpoint coordinator. The constituent processes take checkpoints according to their own clocks at predetermined checkpoint instants. A global consistent set of such asynchronous checkpoints needs to be formed to avoid the domino effect. This is achieved by adding suitable information to the existing clock synchronization messages looking at which the processes synchronize their checkpoints to form a global consistent checkpoint. Communication in this system is synchronous, so, processes may be blocked for communication at checkpointing instants. The blocked processes save the state they were in just before being blocked. It is shown here that the set of such i-th checkpoints is consistent and hence the rollback required by the system in case of failure is only up to the last saved state.