期刊名称:Bulletin of the Technical Committee on Data Engineering
出版年度:2018
卷号:41
期号:1
页码:74
出版社:IEEE Computer Society
摘要:Diagnosing problems in data centers has always been a challenging problem due to their complexity andheterogeneity. Among recent proposals for addressing this challenge, one promising approach leveragesprovenance, which provides the fundamental functionality that is needed for performing fault diagnosisand debugging—a way to track direct and indirect causal relationships between system states and theirchanges. This information is valuable, since it permits system operators to tie observed symptoms ofa faults to their potential root causes. However, capturing provenance in a data center is challengingbecause, at high data rates, it would impose a substantial cost. In this paper, we introduce techniquesthat can help with this: We show how to reduce the cost of maintaining provenance by leveragingstructural similarities for compression, and by offloading expensive but highly parallel operations tohardware. We also discuss our progress towards transforming provenance into compact actionablediagnostic decisions to repair problems caused by misconfigurations and program bugs.