期刊名称:International Journal of Computer Science and Information Technologies
电子版ISSN:0975-9646
出版年度:2012
卷号:3
期号:3
页码:4558-4561
出版社:TechScience Publications
摘要:The emergence of modern technology has enforced to collect the scientific data in a large quantity and those data are getting amassed in different databases. An organized analysis of data is very essential to obtain useful information from swiftly growing data repositories. Cluster analysis is one of the major data mining methods and the k-means clustering algorithm is widely used for many practical applications. But the original kmeans algorithm is computationally expensive and the quality of the resulting clusters substantially relies on the choice of initial centroids. Fast and high quality clustering is one of the most important tasks in the modern era of information processing wherein people rely heavily on search engines. With the huge amount of available data and with an aim to creating better quality clusters, scores of algorithms having qualitycomplexity trade-offs have been proposed. However, the kmeans algorithm proposed during late 1970’s still enjoys a respectable position in the list of clustering algorithms. It is considered to be one of the most fundamental algorithms of data mining. It is basically an iterative algorithm. In each iteration, it requires finding the distance between each data object and centroid of each cluster. Considering the hugeness of modern databases, this task in itself snowballs into a tedious task. This paper proposes an improvement on the classic kmeans algorithm to produce more accurate clusters. The proposed algorithm comprises of a O(n logn) heuristic method, based on sorting and partitioning the input data, for finding the initial centroids in accordance with the data distribution. Experimental results show that the proposed algorithm produces better clusters in less computation time.
关键词:Clustering-means; time complexity; centroid; data;sets.