期刊名称:International Journal of Grid and Distributed Computing
印刷版ISSN:2005-4262
出版年度:2013
卷号:6
期号:3
出版社:SERSC
摘要:The MapReduce framework has been widely used to process and analyze large-scale datasets over large clusters. As an essential problem, join operation amonglarge clusters attracts more and more attention in recent years due to the utilizationof MapReduce. Many strategies have been proposed to improve the e.ciency of dis-tributed join, among which bloomfilter is a successful one. However, the bloomfilter'spotential has not yet been fully exploited, especially in the MapReduce environmen-t. In this paper, three strategies are presented to build the bloomfilter for the largedatasets using MapReduce. Based on these strategies, we design two algorithms fortwo-way join and one algorithm for multi-way join. The experimental results showthat our algorithms can significantly improve the e.ciency of current join algorithm.Moreover, cost models of these algorithms are characterized in order to find out theway of improving the performance of two-way and multi-way joins
关键词:Bloomfilter; MapReduce; Query optimization; Cost model