期刊名称:Journal of Theoretical and Applied Information Technology
印刷版ISSN:1992-8645
电子版ISSN:1817-3195
出版年度:2014
卷号:60
期号:1
出版社:Journal of Theoretical and Applied
摘要:Text clustering is used to group documents with high levels of similarity. It has found applications in different areas of text mining and information retrieval. The digital data available nowadays has grown in huge volume and retrieving useful information from that is a big challenge. Text clustering has found an important application to organize the data and to extract useful information from the available corpus. In this paper, we have proposed a novel method for clustering the text documents. In the first phase features are selected using a genetic based method. In the next phase the extracted keywords are clustered using a hybrid algorithm. The clusters are classed under meaningful topics. The MLCL algorithm works in three phases. Firstly, the linked keywords of the genetic based extraction method are identified with a Must Link and Cannot Link algorithm (MLCL). Secondly, the MLCL algorithm forms the initial clusters. Finally, the clusters are optimized using Gaussian parameters. The proposed method is tested with datasets like Reuters-21578 and Brown Corpus. The experimental results prove that our proposed method has an improved performance than the fuzzy self-constructing feature clustering algorithm.
关键词:Genetic Algorithm; Keyword Extraction; Text Clustering; MLCL Algorithm.