期刊名称:International Journal of Computer Science and Information Technologies
电子版ISSN:0975-9646
出版年度:2011
卷号:2
期号:4
页码:1820-1824
出版社:TechScience Publications
摘要:Clustering analysis is the task of partitioning a set of objects O = {O1… On} into C self-similar subsets based on available data. In general, clustering of unlabeled data poses three major problems: 1) Assessing cluster tendency, i.e., how many clusters to seek? 2) Partitioning the data into C meaningful groups, and 3) Validating the c clusters that are discovered. All clustering algorithms ultimately rely on one or more human inputs, and the most important input is number of clusters (C) to seek. There are many pre and post clustering methods which relieves the user from this choice. These methods ultimately make the choice by thresholding some value in the code. Thus, the choice of c is transferred to the equivalent choice of the hidden threshold that determines C "automatically". In contrast, tendency assessment attempts to estimate c before clustering occurs. Here, we represent the structure of the unlabeled data sets as a Reordered Dissimilarity Image (RDI) where pair wise dissimilarity information about a data set including ‘n’ objects is represented as n x n image. RDI is generated using VAT (Visual Assessment of Cluster tendency), which highlights potential clusters as a set of “dark blocks” along the diagonal of the image, so that number of clusters can be easily estimated using the number of dark blocks across the diagonal. We develop a new method called “Extended Cluster Count Extraction (ECCE) for counting the number of clusters formed along the diagonal of the RDI.