文章基本信息

标题：Text Clustering Using a Suffix Tree Similarity Measure
本地全文：下载
作者：HUANG, Chenghui ; YIN, Jian ; HOU, Fang 等
期刊名称：Journal of Computers
印刷版ISSN：1796-203X
出版年度：2011
卷号：6
期号：10
页码：2180-2186
DOI：10.4304/jcp.6.10.2180-2186
语种：English
出版社：Academy Publisher
摘要：In text mining area, popular methods use the bag-of-words models, which represent a document as a vector. These methods ignored the word sequence information, and the good clustering result limited to some special domains. This paper proposes a new similarity measure based on suffix tree model of text documents. It analyzes the word sequence information, and then computes the similarity between the text documents of corpus by applying a suffix tree similarity that combines with TF-IDF weighting method. Experimental results on standard document benchmark corpus RUTERS and BBC indicate that the new text similarity measure is effective. Comparing with the results of the other two frequent word sequence based methods, our proposed method achieves an improvement of about 15% on the average of F-Measure score.
关键词：clustering algorithm;suffix tree;document model;similarity measure