首页    期刊浏览 2024年07月05日 星期五
登录注册

文章基本信息

  • 标题:Resilient Scheduling Heuristics for Rigid Parallel Jobs
  • 本地全文:下载
  • 作者:Anne Benoit ; Valentin Le Fèvre ; Padma Raghavan
  • 期刊名称:International Journal of Networking and Computing
  • 印刷版ISSN:2185-2847
  • 出版年度:2021
  • 卷号:11
  • 期号:1
  • 页码:2-26
  • 出版社:International Journal of Networking and Computing
  • 摘要:This paper focuses on the resilient scheduling of parallel jobs on high-performance computing (HPC) platforms to minimize the overall completion time, or the makespan. We revisit the classical problem while assuming that jobs are subject to failures caused by transient or silent errors, and hence may need to be re-executed each time they fail to complete successfully. This work generalizes the classical framework where jobs are known offline and do not fail: in this framework, list scheduling that gives priority to the longest jobs is known to be a 3-approximation when imposing to use shelves, and a 2-approximation without this restriction. We show that when jobs can fail, using shelves can be arbitrarily bad, but unrestricted list scheduling remains a 2-approximation. The paper focuses on the design of several heuristics, some list-based and some shelf-based, along with different priority rules and backfilling strategies. We assess and compare their performance through an extensive set of simulations using both synthetic jobs and log traces from the Mira supercomputer.
  • 其他摘要:This paper focuses on the resilient scheduling of parallel jobs on high-performance computing (HPC) platforms to minimize the overall completion time, or the makespan. We revisit the classical problem while assuming that jobs are subject to failures caused by transient or silent errors, and hence may need to be re-executed each time they fail to complete successfully. This work generalizes the classical framework where jobs are known offline and do not fail: in this framework, list scheduling that gives priority to the longest jobs is known to be a 3-approximation when imposing to use shelves, and a 2-approximation without this restriction. We show that when jobs can fail, using shelves can be arbitrarily bad, but unrestricted list scheduling remains a 2-approximation. The paper focuses on the design of several heuristics, some list-based and some shelf-based, along with different priority rules and backfilling strategies. We assess and compare their performance through an extensive set of simulations using both synthetic jobs and log traces from the Mira supercomputer.
  • 关键词:Resilience; scheduling; rigid parallel jobs; silent errors; list schedules; shelf schedules; approximation algorithms
  • 其他关键词:Resilience;scheduling;rigid parallel jobs;silent errors;list schedules;shelf schedules;approximation algorithms
国家哲学社会科学文献中心版权所有