文章基本信息

标题：Effectiveness of Different Similarity Measures for Text Classification and Clustering
本地全文：下载
作者：Komal Maher ; Madhuri S. Joshi
期刊名称：International Journal of Computer Science and Information Technologies
电子版ISSN：0975-9646
出版年度：2016
卷号：7
期号：4
页码：1715-1720
出版社：TechScience Publications
摘要：Present days humans are associated with largeamount of data on regular basis. The sole purpose ofgenerated data is to meet the immediate needs and no attemptin organizing the data for later efficient retrieval. Data miningis a concept of extracting knowledge from such an enormousamount of data.There are many techniques to classify andcluster the data which exists in the structured format, basedon similarity between documents in the text processing field.Clustering algorithms require a metric to quantify howdifferent two given documents are.This difference is oftenmeasured by some distance measure such as Euclideandistance, Cosine similarity, Jaccard correlation, Similaritymeasure for text processing to name a few. In this researchwork, we experiment with Euclidean distance, Cosinesimilarity and Similarity measure for text processing distancemeasures. The effectiveness of these three measures isevaluated on a real-world data set for text classification andclustering problems. The results show that the performanceobtained by the Similarity measure for text processingmeasure is better than that achieved by other measures.
关键词：Document classification; document clustering;entropy; accuracy; classifiers; clustering algorithms