期刊名称:Bulletin of the Technical Committee on Data Engineering
出版年度:2013
卷号:36
期号:1
出版社:IEEE Computer Society
摘要:Companies providing cloud-scale data services have increasing needs to store and analyze massive datasets. For cost and performance reasons, processing is typically done on large clusters of tens of thou-sands of commodity machines. Developers use high-level scripting languages that simplify understand-ing various system trade-offs, but introduce new challenges for query optimization. One key optimizationchallenge is missing accurate data statistics, typically due to massive data volumes and their distributednature, complex computation logic, and frequent usage of user-defined functions. In this paper we de-scribe a technique to optimize a class of jobs that are recurring over time in a cloud-scale computationenvironment. By leveraging information gathered during previous executions we are able to obtain ac-curate statistics for new instances of recurring jobs, resulting in better execution plans. Experiments ona large-scale production system show that our techniques significantly improve cluster utilization