首页    期刊浏览 2024年11月24日 星期日
登录注册

文章基本信息

  • 标题:An Enhanced and Efficient Clustering Algorithm for Large Data Using MapReduce
  • 本地全文:下载
  • 作者:Hongbiao Li ; Ruiying Liu ; Jingdong Wang
  • 期刊名称:IAENG International Journal of Computer Science
  • 印刷版ISSN:1819-656X
  • 电子版ISSN:1819-9224
  • 出版年度:2019
  • 卷号:46
  • 期号:1
  • 页码:61-67
  • 出版社:IAENG - International Association of Engineers
  • 摘要:With the rapid growth of data size, there are great challenges of clustering algorithms in terms of efficiency, reliability and scalability. Recently, many parallel algorithms using MapReduce framework have been proposed to address the scalability problem caused by the size of the data increases. When the massive data is clustered by KMeans algorithm in parallel, it will be read repeatedly in each iterative process, which increases both I/O and network costs significantly. In this paper, we propose a new sampling-based KMeans clustering algorithm, named SKMeans, which decreases the data size effectively, while improves the clustering accuracy by representative verification. Secondly, a parallelized SKMeans using MapReduce, named MR-SKMeans, is implemented on a Hadoop cluster and further explore the effect of MR-SKMeans. The empirical performance of MR-SKMeans is compared to parallel KMeans and other algorithms applying statistical sampling techniques. Our experimental results indicate that MR-SKMeans perform better in terms of efficiency, scalability and accuracy.
  • 关键词:sampling; representative verification; KMeans distributed computing
国家哲学社会科学文献中心版权所有