首页    期刊浏览 2024年11月30日 星期六
登录注册

文章基本信息

  • 标题:A Comparison of Algorithms used to measure the Similarity between two documents
  • 本地全文:下载
  • 作者:Khuat Thanh Tung ; Nguyen Duc Hung ; Le Thi My Hanh
  • 期刊名称:International Journal of Advanced Research in Computer Engineering & Technology (IJARCET)
  • 印刷版ISSN:2278-1323
  • 出版年度:2015
  • 卷号:4
  • 期号:4
  • 页码:1117-1121
  • 出版社:Shri Pannalal Research Institute of Technolgy
  • 摘要:Nowadays, measuring the similarity of documents plays an important role in text related researches and applications such as document clustering, plagiarism detection, information retrieval, machine translation and automatic essay scoring. Many researches have been proposed to solve this problem. They can be grouped into three main approaches: String-based, Corpus-based and Knowledge-based Similarities. In this paper, the similarity of two documents is gauged by using two string-based measures which are character-based and term-based algorithms. In character-based method, n-gram is utilized to find fingerprint for fingerprint and winnowing algorithms, then Dice coefficient is used to match two fingerprints found. In term-based measurement, cosine similarity algorithm is used. In this work, we would like to compare the effectiveness of algorith ms used to measure the similarity between two documents. From the obtained results, we can find that the performance of fingerprint and winnowing is better than the cosine similarity. Moreover, the winnowing algorithm is more stable than others.
  • 关键词:Cosine Similarity; Similarity Measure; Dice ; Coefficient; Fingerprint; Winnowing algorithm
国家哲学社会科学文献中心版权所有