期刊名称:International Journal of Computer Science & Technology
印刷版ISSN:2229-4333
电子版ISSN:0976-8491
出版年度:2014
卷号:5
期号:4
页码:303-305
语种:English
出版社:Ayushmaan Technologies
摘要:Duplicate detection consists in detecting multiple representations of a same real-world object, and that for every object represented in a data source. Duplicate detection is relevant in data cleaning and data integration applications and has been studied extensively for relational data describing a single type of object in a single table. This paper focuses on iterative duplicate detection in XML data. We consider detecting duplicates in multiple types of objects related to each other and devise methods adapted to semi-structured XML data. Relationships between different types of objects either form a hierarchical structure or a graph structure. Iterative duplicate detection require a similarity measure to compare pairs of object representations, called candidates, based on descriptive information of a candidate. The distinction between candidates and their description is not straightforward in XML, but we show that we can semi-automatically determine these descriptions using heuristics and conditions.