期刊名称:International Journal of Computer Science and Information Technologies
电子版ISSN:0975-9646
出版年度:2017
卷号:8
期号:3
页码:352-356
出版社:TechScience Publications
摘要:Record deduplication is one of the challengingresearch areas in data mining. In most of the organizations,the storage systems have duplicate copies of several pieces ofdata. The dedicated data compression method is datadeduplication which is used to remove the duplicate copies ofrepeating data. In previous research, genetic programmingbased record deduplication was used in which combinedvarious pieces of evidence extracted from the data content.However, the true positive level of the system is low.Therefore, the performance of the record deduplicationsystem is degraded. To solve this problem, the HiddenMarkov Model based record deduplication method isproposed. In a HMM, the records with different attributesare called states and similarity functions among the couple ofrecords are called transition. The data records attributeinformation are cleaned, standardised and implementedthrough a Hidden Markov Models (HMMs). Evaluating theperformance of the system using Restaurants data set andCora Bibliographic data set. The result obtained is the HMMbased results, the duplicate and non-duplicate records of data.The system improves true positive level of the system.