期刊名称:International Journal of Innovative Research in Computer and Communication Engineering
印刷版ISSN:2320-9798
电子版ISSN:2320-9801
出版年度:2015
卷号:3
期号:9
DOI:10.15680/IJIRCCE.2015. 0309080
出版社:S&S Publications
摘要:With the ever increasing volume of data, data quality problems abound. Multiple, yet differentrepresentations of the same real-world objects in data, duplicates, are one of the most intriguing data quality problems.The effects of such duplicates are detrimental. For instance, bank customers can obtain duplicate identities, inventorylevels are monitored incorrectly, catalogs are mailed multiple times to the same household, etc. Automatically detectingduplicates is difficult. Duplicate detection is the process for identifying multiple representations of same real worldentities. Nowadays, duplicate detection methods need to process ever larger datasets in ever shorter time: maintainingthe quality of a dataset becomes increasingly difficult. Genetic algorithm is proposed that significantly increase theefficiency of finding duplicates if the execution time is limited. This efficiently detects the text document duplicationwhich has same content with distinct file name or different content with same file name.