首页    期刊浏览 2024年07月19日 星期五
登录注册

文章基本信息

  • 标题:MVJoin:An Efficient Approach for Record Linkage and Duplication Finding
  • 本地全文:下载
  • 作者:Ms. Laxmi R. Adhav ; Prof. Santosh D. Kumar
  • 期刊名称:International Journal of Electronics, Communication and Soft Computing Science and Engineering
  • 印刷版ISSN:2277-9477
  • 出版年度:2015
  • 卷号:4
  • 期号:Special 2
  • 出版社:IJECSCSE
  • 摘要:Duplicate detection is major task in data processingand cleaning. In this paper we discussed about various methods ofduplicate detection for a dataset. Calculating Edit Distance is themost preferred approach for duplicate detection. A similarity joincalculates different entities from two data sets for their similarityvalue, not less than a given threshold. The most commonly usedapproach is based on technique of extracting overlapping charactersfrom strings and considering only strings, that share some decided(q) number of characters as candidates. While calculating editdistance strings are divided into number of small strings calledChunks. The proposed system–Modified VChunk Join (MVJoin)algorithm uses a greedy approach to automatically select a suitablechunking scheme for a given dataset, to find duplicates efficiently.This system implements an efficient approach for finding duplicaterecords in data sets as well as linking records of various dataset intosingle one. It is experimentally demonstrated that the MVJoinalgorithm is faster than alternative methods occupying less space.
  • 关键词:Chunks; CDB; Edit Distance; Gram; MVJoin;Similarity Join; Virtual CDB.
国家哲学社会科学文献中心版权所有