期刊名称:International Journal of Electronics, Communication and Soft Computing Science and Engineering
印刷版ISSN:2277-9477
出版年度:2015
卷号:4
期号:Special 2
出版社:IJECSCSE
摘要:Duplicate detection is major task in data processingand cleaning. In this paper we discussed about various methods ofduplicate detection for a dataset. Calculating Edit Distance is themost preferred approach for duplicate detection. A similarity joincalculates different entities from two data sets for their similarityvalue, not less than a given threshold. The most commonly usedapproach is based on technique of extracting overlapping charactersfrom strings and considering only strings, that share some decided(q) number of characters as candidates. While calculating editdistance strings are divided into number of small strings calledChunks. The proposed system–Modified VChunk Join (MVJoin)algorithm uses a greedy approach to automatically select a suitablechunking scheme for a given dataset, to find duplicates efficiently.This system implements an efficient approach for finding duplicaterecords in data sets as well as linking records of various dataset intosingle one. It is experimentally demonstrated that the MVJoinalgorithm is faster than alternative methods occupying less space.