首页    期刊浏览 2024年11月28日 星期四
登录注册

文章基本信息

  • 标题:Detection of near duplicates in tables based on the locality-sensitive hashing method and the nearest neighbor method
  • 本地全文:下载
  • 作者:Petro Lizunov ; Andrii Biloshchytskyi ; Alexander Kuchansky
  • 期刊名称:Eastern-European Journal of Enterprise Technologies
  • 印刷版ISSN:1729-3774
  • 电子版ISSN:1729-4061
  • 出版年度:2016
  • 卷号:6
  • 期号:4
  • 页码:4-10
  • DOI:10.15587/1729-4061.2016.86243
  • 语种:English
  • 出版社:PC Technology Center
  • 摘要:A hybrid method for the detection of near duplicates in tables is proposed.This method allows the identification of similarities between text and numeric data of tables separately, and then it generalized the results obtained. For the text data, sequences of words are formed in the canonized form, from which, based on the method of locality-sensitive hashing, the bit sequences are constructed. A similarity between data in this case is determined by the Hamming distance at the assigned threshold value. The identification of similarities between numeric data in tables is implemented based on the method of the nearest neighbours with assigned metric distances. The method makes it possible to identify near duplicates, present in data in the input table, relative to a set of tables, which are selected from the scientific publications and dissertations and theses papers. It should be noted that the method is designed for finding near duplicates in tables that contain only text and numeric data. In the case of availability in the content of examined tables of pictures and formulas, these objects are examined separately by using specific methods.The method proposed might be implemented in the systems that are intended for running intelligent analysis of information represented by text and tables to identify similarities and detect near-duplicates, in particular, antiplagiarism-systems.
  • 关键词:near duplicate;similarity;locality-sensitive hashing method;nearest neighbor method
国家哲学社会科学文献中心版权所有