期刊名称:International Journal of Engineering and Computer Science
印刷版ISSN:2319-7242
出版年度:2014
卷号:3
期号:10
页码:8821-8822
出版社:IJECS
摘要:Clustering techniques are used for automatically organizing or summarizing a large collection of text; therehave been many approaches to clustering. As described below, for the purpose of the work, we areparticularly interested in two of them: coclustering and constrained clustering. This thesis proposes a novelconstrained coclustering method to achieve two goals. First, it combines information-theoretic coclusteringand constrained clustering to improve clustering performance. Second, it adopts both supervised andunsupervised constraints to demonstrate the effectiveness of the algorithm.The unsupervised constraints are automatically derived from existing knowledge sources, thus saving theeffort and cost of using manually labeled constraints. To achieve our first goal, we develop a two-sidedhidden Markov random field (HMRF) model to represent both document and word constraints. It then usedan alternating expectation maximization (EM) algorithm to optimize the model. It also proposes two novelmethods to automatically construct and incorporate document and word constraints to support unsupervisedconstrained clustering. 1) Automatically construct document constraints 2) Automatically construct wordconstraints The results of the evaluation demonstrates the superiority of our approaches against a number ofexisting approaches.Unlike existing approaches, this thesis applies stop word removal, stemming andsynonym word replacement to apply semantic similarity between words in the documents. In addition,content can be retrieved from text files, HTML pages as well as XML pages. Tags are eliminated fromHTML files. Attribute name and values are taken as normal paragraph words in XML files and thenpreprocessing (stop word removal, stemming and synonym word replacement) is applied.
关键词:Constrained clustering; coclustering; unsupervised constraints; text clustering