期刊名称:IAENG International Journal of Computer Science
印刷版ISSN:1819-656X
电子版ISSN:1819-9224
出版年度:2019
卷号:46
期号:1
页码:61-67
出版社:IAENG - International Association of Engineers
摘要:With the rapid growth of data size, there are great challenges of clustering algorithms in terms of efficiency, reliability and scalability. Recently, many parallel algorithms using MapReduce framework have been proposed to address the scalability problem caused by the size of the data increases. When the massive data is clustered by KMeans algorithm in parallel, it will be read repeatedly in each iterative process, which increases both I/O and network costs significantly. In this paper, we propose a new sampling-based KMeans clustering algorithm, named SKMeans, which decreases the data size effectively, while improves the clustering accuracy by representative verification. Secondly, a parallelized SKMeans using MapReduce, named MR-SKMeans, is implemented on a Hadoop cluster and further explore the effect of MR-SKMeans. The empirical performance of MR-SKMeans is compared to parallel KMeans and other algorithms applying statistical sampling techniques. Our experimental results indicate that MR-SKMeans perform better in terms of efficiency, scalability and accuracy.