期刊名称:International Journal of Computer Science and Information Technologies
电子版ISSN:0975-9646
出版年度:2014
卷号:5
期号:3
页码:3514-3520
出版社:TechScience Publications
摘要:Unsupervised learning involves exploring the unlabeled data to find some intrinsic or hidden structures. Duplicate detection enables to identify the records that represent the same real world entity. In the field of Data mining, there is an exponential growth in the amount data available. Thus, linking or matching records from various web databases is a major challenge as it involves complexity of comparing, each record in one database with all the records in other databases. Supervised learning methods fail in web database scenario as the records to be matched are query dependent. In the previous work, to handle this context, an online based untrained record linkage method, UDD was suggested. UDD proficiently identifies corresponding record pairs that represent same entity, from multiple web databases but is time consuming. This paper focuses on enhancing the performance of UDD by adding a blocking step. A computationally cheap clustering approach, Canopy Clustering is deployed in blocking step. Thus, prior to classifying records into duplicates and non-duplicates, clustering is performed and blocks of candidate record pairs are generated. Experimental results show that blocking optimizes the working of UDD in web database scenario.
关键词:Unsupervised learning; Record linkage;Duplicate Detection; Record Matching; Blocking; Canopy;Clustering; Web database; Query result