首页    期刊浏览 2024年10月05日 星期六
登录注册

文章基本信息

  • 标题:Text Document Clustering by Using Semi-supervised Learning and Outlier Detection
  • 本地全文:下载
  • 作者:Bahman Askari ; Sattar Hashemi
  • 期刊名称:International Journal of Mechatronics, Electrical and Computer Technology
  • 印刷版ISSN:2305-0543
  • 出版年度:2017
  • 卷号:7
  • 期号:23
  • 页码:3255-3262
  • 出版社:Austrian E-Journals of Universal Scientific Organization
  • 摘要:Text document clustering is the process of grouping similar documents into clusters. Clustering is a technique of unsupervised categorization which divides the objects of dataset into a specific number of clusters based on the criterion of similarity or dissimilarity. This categorization is in a way that the resulted clusters are distinct as possible and with the maximum of inside cluster similarity. K-means algorithm is one of the most famous and most liked techniques of clustering because it is easy to understand and perform. Also it has a kind of linear complexity. K-means suffers from outliers in data sets, high sensitivity to the initial centers and also correct number of clusters. In order to overcome this drawbacks we propose a novel method in three stages. In the first stage ODBD algorithm is applied to detect the outliers and divide data sets to two groups; normal objects and outlier objects. Then FICBC algorithm is run on normal objects to calculate the cluster centers intelligently by using additional information and initial knowledge such as K; the number of clusters, cannot and must linked sets. Finally we use the centers obtained from the previous phase and with an iteration, we calculate the distance between each of outliers and these centers. Then, each outlier is assigned to the closest cluster. The Euclidean distance criterion is used for the calculation of this distance. The proposed method is run on the UCI dataset and the obtained results are compared to the other clustering methods. Experiments show that the proposed method achieves significantly better results than previous clustering approaches.
  • 关键词:Text Document Clustering; Outlier Detection; Semi-supervised Learning; K-means
国家哲学社会科学文献中心版权所有