期刊名称:Journal of Theoretical and Applied Information Technology
印刷版ISSN:1992-8645
电子版ISSN:1817-3195
出版年度:2014
卷号:68
期号:1
出版社:Journal of Theoretical and Applied
摘要:Clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. Clustering is done using lingo algorithm by extracting the data contents in the document. The data is stored in XML, which manages large volume of data. Lingo combines several existing methods to put special emphasis on meaningful cluster descriptions, apart from identifying document similarities. The steps involved in this process are designing the term-document matrix and then extracting the frequent phrase using suffix arrays. Readable and unambiguous descriptions of the thematic groups are an important factor of the overall quality of clustering. The Lingo algorithm consist of five phases, they are Pre-processing, Extraction of Frequent phrase, Induction of Cluster label, Discovery of Cluster content, Final cluster formation.