期刊名称:International Journal on Electrical Engineering and Informatics
印刷版ISSN:2085-6830
出版年度:2019
卷号:11
期号:4
页码:822-832
DOI:10.15676/ijeei.2019.11.4.13
出版社:School of Electrical Engineering and Informatics
摘要:Sequence of word sequence has been considered as an appropriate text representationsince text reveal inherent sequential nature. Those representations are Frequent Word Sequence(FWS), Set of Frequent Word Sequence (SFWS) and Frequent Word Itemsets (FWI).Moreover, Maximal Frequent Sequence (MFS) is text feature that exploiting sequentialproperty of textual data. In this paper, we proposed SFWS as the best text representation fordocument clustering. SFWS considers document as set of sentences in which sentence is thelanguage highest grammatical hierarchy, conveying a complete thought. Consequently,document clustering would have accurate results. The main contribution of this work is thedata pre-processing, feature extraction and selection based on SFW. Since SFWS works basedon sentence, we need to construct sequence sentences of all document into sequence databasefor sentences. Then, sequential pattern mining was applied to extract set of frequent sentencesequence. And finally, we select features with maximal set of frequent sequence (MSFS). Weconducted experiments on Twenty News Group Text Data (TNTD). To do so, we developedFeature based clustering (FBC) algorithm with MSFS as text feature based on SFWSrepresentation. The experimental results showed that document clustering based on SFWS hadthe highest accuracy, compared with FWS and FWI.
关键词:Frequent Word Sequence (FWS); Set of Frequent Word Sequence (SFWS);Frequent Word Itemset (FWI); Maximal Frequent Sequence (MFS); document clustering;Feature Base Clustering