首页    期刊浏览 2024年09月20日 星期五
登录注册

文章基本信息

  • 标题:Hybrid Distance Based Document Clustering with Keyword and Phrase Indexing
  • 本地全文:下载
  • 作者:Subhadra Kompella ; M. Shashi
  • 期刊名称:International Journal of Computer Science Issues
  • 印刷版ISSN:1694-0784
  • 电子版ISSN:1694-0814
  • 出版年度:2012
  • 卷号:9
  • 期号:2
  • 出版社:IJCSI Press
  • 摘要:Document Clustering algorithms group a set of documents into subsets or clusters. Several applications of clustering exist in information retrieval. Our proposed method uses Scatter-Gather approach for clustering group of documents from an entire collection. The selected groupsare merged and the resulting set is again clustered. This process is repeateduntil a cluster of interest is found. This research presents a model for documentclustering that arranges unstructured documents into content-basedhomogeneous groups. The clustering approach uses the popular Cosine similarity measure combined with Euclidian distance measure. To the best of our knowledge, much work has been carried on keyword based clustering and Phrase index based clustering. Our method attempts to combine the two. The method has been applied to standard NewsGroup-20 dataset having documents distributed over 20 different topics. Results have been verified considering fixed number of clusters and different corpora and with variable number of clusters for fixed corpora. Both results indicate a steady increase in the overall purity of clustering compared to the keyword-based clustering method. With Keyword-based clustering, the purity was seen to increase for increasing number of clusters for a fixed corpora, but the purity was observed to decrease with fixed number of clusters and increase in number of corpora. In our method, the increase in purity was more pronounced with increase in number of clusters.
  • 关键词:Document clustering; Phraseindex; Purity
国家哲学社会科学文献中心版权所有