期刊名称:International Journal of Computer Technology and Applications
电子版ISSN:2229-6093
出版年度:2011
卷号:2
期号:5
页码:1197-1200
出版社:Technopark Publications
摘要:Finding useful patterns in large datasets has attracted considerable interest recently and one of the most widely studied problems in this area is the identification of clusters, or densely y populated regions, in a multi-dimensional dataset. Prior work does not adequately address the problem of large datasets and minimization of I/O costs. Clustering of categorical attributes is a difficult problem that has not received as much attention as its numerical counterpart. In this paper we explore the connection between clustering and entropy: clusters of similar points have lower entropy than those of dissimilar ones. We use this connection to design a heuristic algorithm, which is capable of efficiently cluster large data sets of records with categorical attributes. In contrast with other categorical clustering algorithms published in the past, clustering results are very stable for different sample sizes and parameter settings. Also, the criteria for clustering are a very intuitive one, since it is deeply rooted on the well-known notion of entropy
关键词:Data mining; categorical clustering; data labeling.