文章基本信息

标题：Analysis of Stemming Algorithm for Text Clustering
本地全文：下载
作者：N.Sandhya ; Y.Srilalitha ; V.Sowmya 等
期刊名称：International Journal of Computer Science Issues
印刷版ISSN：1694-0784
电子版ISSN：1694-0814
出版年度：2011
卷号：8
期号：5
出版社：IJCSI Press
摘要：Text document clustering plays an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In Bag of words representation of documents the words that appear in documents often have many morphological variants and in most cases, morphological variants of words have similar semantic interpretations and can be considered as equivalent for the purpose of clustering applications. For this reason, a number of stemming Algorithms, or stemmers, have been developed, which attempt to reduce a word to its stem or root form. Thus, the key terms of a document are represented by stems rather than by the original words. In this work we have studied the impact of stemming algorithm along with four popular similarity measures (Euclidean, cosine, Pearson correlation and extended Jaccard) in conjunction with different types of vector representation (boolean, term frequency and term frequency and inverse document frequency) on cluster quality. For Clustering documents we have used partitional based clustering technique K Means. Performance is measured against a human-imposed classification of Classic data set. We conducted a number of experiments and used entropy measure to assure statistical significance of results. Cosine, Pearson correlation and extended Jaccard similarities emerge as the best measures to capture human categorization behavior, while Euclidean measures perform poor. After applying the Stemming algorithm Euclidean measure shows little improvement.
关键词：Text clustering; Stemming Algorithm; Similarity Measures; Cluster Accuracy.