文章基本信息

标题：Text Document Clustering by Using Semi-supervised Learning and Outlier Detection
本地全文：下载
作者：Bahman Askari ; Sattar Hashemi
期刊名称：International Journal of Mechatronics, Electrical and Computer Technology
印刷版ISSN：2305-0543
出版年度：2017
卷号：7
期号：23
页码：3255-3262
出版社：Austrian E-Journals of Universal Scientific Organization
摘要：Text document clustering is the process of grouping similar documents into clusters. Clustering is a technique of unsupervised categorization which divides the objects of dataset into a specific number of clusters based on the criterion of similarity or dissimilarity. This categorization is in a way that the resulted clusters are distinct as possible and with the maximum of inside cluster similarity. K-means algorithm is one of the most famous and most liked techniques of clustering because it is easy to understand and perform. Also it has a kind of linear complexity. K-means suffers from outliers in data sets, high sensitivity to the initial centers and also correct number of clusters. In order to overcome this drawbacks we propose a novel method in three stages. In the first stage ODBD algorithm is applied to detect the outliers and divide data sets to two groups; normal objects and outlier objects. Then FICBC algorithm is run on normal objects to calculate the cluster centers intelligently by using additional information and initial knowledge such as K; the number of clusters, cannot and must linked sets. Finally we use the centers obtained from the previous phase and with an iteration, we calculate the distance between each of outliers and these centers. Then, each outlier is assigned to the closest cluster. The Euclidean distance criterion is used for the calculation of this distance. The proposed method is run on the UCI dataset and the obtained results are compared to the other clustering methods. Experiments show that the proposed method achieves significantly better results than previous clustering approaches.
关键词：Text Document Clustering; Outlier Detection; Semi-supervised Learning; K-means