文章基本信息

标题：Clustering Large Sparse Text Data: A Comparative Advantage Approach
本地全文：下载
作者：Jie Ji ; Tony Y. T. Chan ; Qiangfu Zhao 等
期刊名称：Information and Media Technologies
电子版ISSN：1881-0896
出版年度：2010
卷号：5
期号：4
页码：1208-1217
DOI：10.11185/imt.5.1208
出版社：Information and Media Technologies Editorial Board
摘要：Document clustering is the process of partitioning a set of unlabeled documents into clusters such that documents within each cluster share some common concepts. To analyze the clusters easily, it is convenient to represent the concepts using some key terms. However, by using terms as features, text data is represented in a very high-dimensional vector space, and the computational cost is high. Note that the text data are of high sparsity, and not all weights in the centers are important for classification. Based on this observation, we propose in this study a comparative advantage-based clustering algorithm which can find out the relative strength between clusters, as well as keep and enlarge their strength. Since the vectors are represented by term frequency, the clustering results are more comprehensible compared with dimensionality reduction methods. Experimental results show that the proposed algorithm can keep the characteristic of k -means algorithm, but the computational cost is much lower. Moreover, we also found that the proposed method has a higher chance of getting better results.