文章基本信息

标题：Hybrid Distance Based Document Clustering with Keyword and Phrase Indexing
本地全文：下载
作者：Subhadra Kompella ; M. Shashi
期刊名称：International Journal of Computer Science Issues
印刷版ISSN：1694-0784
电子版ISSN：1694-0814
出版年度：2012
卷号：9
期号：2
出版社：IJCSI Press
摘要：Document Clustering algorithms group a set of documents into subsets or clusters. Several applications of clustering exist in information retrieval. Our proposed method uses Scatter-Gather approach for clustering group of documents from an entire collection. The selected groupsare merged and the resulting set is again clustered. This process is repeateduntil a cluster of interest is found. This research presents a model for documentclustering that arranges unstructured documents into content-basedhomogeneous groups. The clustering approach uses the popular Cosine similarity measure combined with Euclidian distance measure. To the best of our knowledge, much work has been carried on keyword based clustering and Phrase index based clustering. Our method attempts to combine the two. The method has been applied to standard NewsGroup-20 dataset having documents distributed over 20 different topics. Results have been verified considering fixed number of clusters and different corpora and with variable number of clusters for fixed corpora. Both results indicate a steady increase in the overall purity of clustering compared to the keyword-based clustering method. With Keyword-based clustering, the purity was seen to increase for increasing number of clusters for a fixed corpora, but the purity was observed to decrease with fixed number of clusters and increase in number of corpora. In our method, the increase in purity was more pronounced with increase in number of clusters.
关键词：Document clustering; Phraseindex; Purity