期刊名称:International Journal of Networking and Computing
印刷版ISSN:2185-2847
出版年度:2019
卷号:9
期号:1
页码:28-52
出版社:International Journal of Networking and Computing
摘要:Input/output (I/O) from various sources often contend for scarcely available bandwidth. For
example, checkpoint/restart (CR) protocols can help to ensure application progress in failureprone
environments. However, CR I/O alongside an application’s normal, requisite I/O can
increase I/O contention and might negatively impact performance. In this work, we consider
different aspects (system-level scheduling policies and hardware) that optimize the overall performance
of concurrently executing CR-based applications that share I/O resources. We provide
a theoretical model and derive a set of necessary constraints to minimize the global waste on a
given platform. Our results demonstrate that Young/Daly’s optimal checkpoint interval, despite
providing a sensible metric for a single, undisturbed application, is not sufficient to optimally
address resource contention at scale. We show that by combining optimal checkpointing periods
with contention-aware system-level I/O scheduling strategies, we can significantly improve overall
application performance and maximize the platform throughput. Finally, we evaluate how
specialized hardware, namely burst buffers, may help to mitigate the I/O contention problem.
Overall, these results provide critical analysis and direct guidance on how to design efficient,
CR ready, large -scale platforms without a large investment in the I/O subsystem.
其他摘要:Input/output (I/O) from various sources often contend for scarcely available bandwidth. For example, checkpoint/restart (CR) protocols can help to ensure application progress in failure-prone environments. However, CR I/O alongside an application's normal, requisite I/O can increase I/O contention and might negatively impact performance. In this work, we consider different aspects (system-level scheduling policies and hardware) that optimize the overall performance of concurrently executing CR-based applications that share I/O resources. We provide a theoretical model and derive a set of necessary constraints to minimize the global waste on a given platform. Our results demonstrate that Young/Daly's optimal checkpoint interval, despite providing a sensible metric for a single, undisturbed application, is not sufficient to optimally address resource contention at scale. We show that by combining optimal checkpointing periods with contention-aware system-level I/O scheduling strategies, we can significantly improve overall application performance and maximize the platform throughput. Finally, we evaluate how specialized hardware, namely burst buffers, may help to mitigate the I/O contention problem. Overall, these results provide critical analysis and direct guidance on how to design efficient, CR ready, large -scale platforms without a large investment in the I/O subsystem.