文章基本信息

标题：Managing Skew in Hadoop
本地全文：下载
作者：YongChul Kwon ; Kai Ren ; Magdalena Balazinska 等
期刊名称：Bulletin of the Technical Committee on Data Engineering
出版年度：2013
卷号：36
期号：1
出版社：IEEE Computer Society
摘要：Challenges in Big Data analytics stem not only from volume, but also variety: extreme diversity in bothdata types (e.g., text, images, and graphs) and in operations beyond relational algebra (e.g., machinelearning, natural language processing, image processing, and graph analysis). As a result, any com-petitive Big Data system must support some form of parallel user-defined operations (UDOs) that cancapture complex data processing tasks over complex data types without changing the core of the paralleldata processing engine. Hadoop and other popular systems have been shown to provide a convenientprogramming model for implementing parallel UDOs, but the "black-box" nature of UDOs compli-cates the automatic load balancing required to achieve parallel scalability. In this paper, we present anoverview of some of our recent work that tackles the problem of load imbalance (a.k.a. skew) in parallelUDO evaluation. We first discuss the prevalence of skew in today's applications and clusters. We thendiscuss our experience with static and dynamic methods for mitigating it.