首页    期刊浏览 2025年04月12日 星期六
登录注册

文章基本信息

  • 标题:Clustering Large Sparse Text Data: A Comparative Advantage Approach
  • 本地全文:下载
  • 作者:Jie Ji ; Tony Y. T. Chan ; Qiangfu Zhao
  • 期刊名称:Information and Media Technologies
  • 电子版ISSN:1881-0896
  • 出版年度:2010
  • 卷号:5
  • 期号:4
  • 页码:1208-1217
  • DOI:10.11185/imt.5.1208
  • 出版社:Information and Media Technologies Editorial Board
  • 摘要:Document clustering is the process of partitioning a set of unlabeled documents into clusters such that documents within each cluster share some common concepts. To analyze the clusters easily, it is convenient to represent the concepts using some key terms. However, by using terms as features, text data is represented in a very high-dimensional vector space, and the computational cost is high. Note that the text data are of high sparsity, and not all weights in the centers are important for classification. Based on this observation, we propose in this study a comparative advantage-based clustering algorithm which can find out the relative strength between clusters, as well as keep and enlarge their strength. Since the vectors are represented by term frequency, the clustering results are more comprehensible compared with dimensionality reduction methods. Experimental results show that the proposed algorithm can keep the characteristic of k -means algorithm, but the computational cost is much lower. Moreover, we also found that the proposed method has a higher chance of getting better results.
国家哲学社会科学文献中心版权所有