期刊名称:Journal of Theoretical and Applied Information Technology
印刷版ISSN:1992-8645
电子版ISSN:1817-3195
出版年度:2017
卷号:95
期号:10
出版社:Journal of Theoretical and Applied
摘要:Text documents occupy the major source of data and hence it is important to keep the data in an organized fashion. Clustering is one of the ways for data organization, which tends to group similar documents together. In spite of the presence of numerous existing clustering algorithms, still there is an emergent need for accurate clustering algorithms. Additionally, most of the clustering algorithms work by distance based measures, which is the reason for lack of accuracy. In order to overcome these issues, this work presents a double layered text document clustering algorithm. The entire system is categorized into phases such as document pre-processing, representation, clustering and cluster labelling. The document pre-processing phase prepares the document in such a way that it is suitable for the forthcoming processes. The document representation phase is to standardize the structure of the document and this is done by Document Index Graph (DIG) model. The documents are then clustered by cosine similarity and rough set of clusters are formed. The second level of cluster refinement is achieved by ConceptNet, which works on the basis of common sense reasoning. Finally, the clusters are labelled by picking the top ranked key-phrase. This work is tested over BBCSport and 20 NewsGroup dataset and the proposed approach proves better results in terms of F-measure, purity and entropy.
关键词:Document clustering; DIG model; Sense based clustering; Distance based clustering