首页    期刊浏览 2024年10月07日 星期一
登录注册

文章基本信息

  • 标题:Probabilistic, Statistical and Algorithmic Aspects of the Similarity of Texts and Application to Gospels Comparison
  • 本地全文:下载
  • 作者:Soumaila Dembele ; Gane Samb Lo
  • 期刊名称:Journal of Data Analysis and Information Processing
  • 印刷版ISSN:2327-7211
  • 电子版ISSN:2327-7203
  • 出版年度:2015
  • 卷号:03
  • 期号:04
  • 页码:112-127
  • DOI:10.4236/jdaip.2015.34012
  • 语种:English
  • 出版社:Scientific Research Publishing
  • 摘要:The fundamental problem of similarity studies, in the frame of data-mining, is to examine and detect similar items in articles, papers, and books with huge sizes. In this paper, we are interested in the probabilistic, and the statistical and the algorithmic aspects in studies of texts. We will be using the approach of k-shinglings, a k-shingling being defined as a sequence of k consecutive characters that are extracted from a text (k ≥ 1). The main stake in this field is to find accurate and quick algorithms to compute the similarity in short times. This will be achieved in using approximation methods. The first approximation method is statistical and, is based on the theorem of Glivenko-Cantelli. The second is the banding technique. And the third concerns a modification of the algorithm proposed by Rajaraman et al. ([1]), denoted here as (RUM). The Jaccard index is the one being used in this paper. We finally illustrate these results of the paper on the four Gospels. The results are very conclusive.
  • 关键词:Similarity;Web Mining;Jaccard Similarity;RU Algorithm;Minhashing;Data Mining;Shingling;Bible’s Gospels;Glivenko-Cantelli;Expected Similarity;Statistical Estimation
国家哲学社会科学文献中心版权所有