首页    期刊浏览 2024年11月23日 星期六
登录注册

文章基本信息

  • 标题:Exploiting Application-Level Correctness for Low-Cost Fault Tolerance
  • 本地全文:下载
  • 作者:Xuanhua Li ; Donald Yeung
  • 期刊名称:The Journal of Instruction-Level Parallelism
  • 电子版ISSN:1942-9525
  • 出版年度:2008
  • 卷号:10
  • 页码:1-28
  • 出版社:International Symposium on Microarchitecture
  • 摘要:Traditionally, fault tolerance researchers have required architectural state to be nu-merically perfect for program execution to be correct. However, in many programs, even ifexecution is not 100% numerically correct, the program can still appear to execute correctlyfrom the user's perspective. Hence, whether a fault is unacceptable or benign may dependon the level of abstraction at which correctness is evaluated, with more faults being benignat higher levels of abstraction, i.e. at the user or application level, compared to lower levelsof abstraction, i.e. at the architecture level.The extent to which programs are more fault resilient at higher levels of abstraction isapplication dependent. Programs that produce inexact and/or approximate outputs canbe very resilient at the application level. We call such programs soft computations, andwe find they are common in multimedia workloads, as well as artificial intelligence (AI)workloads. Programs that compute exact numerical outputs o.er less error resilience atthe application level. However, we find all programs studied in this paper exhibit someenhanced fault resilience at the application level, including those that are traditionallyconsidered exact computations–e.g., SPECInt CPU2000.This paper investigates definitions of program correctness that view correctness fromthe application's standpoint rather than the architecture's standpoint. Under application-level correctness, a program's execution is deemed correct as long as the result it producesis acceptable to the user. To quantify user satisfaction, we rely on application-level fi-delity metrics that capture user-perceived program solution quality. We conduct a detailedfault susceptibility study that measures how much more fault resilient programs are whendefining correctness at the application level compared to the architecture level. Our re-sults show for 6 multimedia and AI benchmarks that 45.8% of architecturally incorrectfaults are correct at the application level. For 3 SPECInt CPU2000 b enchmarks, 17.6%of architecturally incorrect faults are correct at the application level. We also present twolightweight fault recovery mechanisms, stack recovery and hard state recovery, that exploitthe relaxed requirements of application-level correctness to reduce checkpoint cost. Stackrecovery recovers 66.3% of crashes in soft computations with near-zero runtime overhead,and hard state recovery recovers 89.7% of crashes in soft computations with half the runtimeoverhead of conventional incremental checkpointing under application-level correctness
国家哲学社会科学文献中心版权所有