文章基本信息

标题：Provenance for Scientific Workflows Towards Reproducible Research
本地全文：下载
作者：Roger Barga ; Yogesh Simmhan ; Eran Chinthaka 等
期刊名称：Bulletin of the Technical Committee on Data Engineering
出版年度：2010
卷号：33
期号：03
出版社：IEEE Computer Society
摘要：Panda (for Provenance and Data) is a new project whose goal is to develop a general-purpose system that uniﬁes concepts from existing provenance systems and overcomes some limitations in them. Panda is designed for “data-oriented workﬂows,” fully integrating data-based and process-based provenance. Panda’s provenance model will support a full range from ﬁne-grained to coarse-grained provenance. Panda will provide a set of built-in operators for exploiting provenance after it has been captured, and an ad-hoc query language over provenance together with data. The processing nodes in Panda’s workﬂows can vary from well-understood relational transformations, to “semi-opaque” transformations with a few known properties, to fully-opaque “black boxes.” A theme in Panda is to take advantage of transformation knowledge when present, but to degrade gracefully when less information is available. Panda yields interesting optimization problems, including data caching decisions and eager vs. lazy provenance capture. This paper is largely an overview of motivation and plans for the project, with some material on current progress and results.