期刊名称:International Journal of Computer Science & Technology
印刷版ISSN:2229-4333
电子版ISSN:0976-8491
出版年度:2013
卷号:4
期号:3
页码:286-289
语种:English
出版社:Ayushmaan Technologies
摘要:Duplicate detection is the problem of detecting different entries in a data source representing the same real-world entity. While research abounds in the realm of duplicate detection in relational data, there is yet little work for duplicates in other, more complex data models, such as XML. Our research in XML duplicate detection addresses four major challenges. First, we investigate on how object descriptions can be selected automatically, a difficult task in XML where objects and object descriptions are both represented by XML elements. Second, we define new domain-independent duplicate classifiers that take into account not only data, but also structural diversity of XML objects. Third, we define comparison strategies that make use of element dependencies to improve efficiency without jeopardizing effectiveness. Finally, we consider scalability by investigating how relational and XML databases can support the duplicate detection process. By considering the problem of XML duplicate detection under the aspects of effectiveness, efficiency and scalability, we believe that our insights and solutions will significantly contribute to solving XML duplicate detection for a wide range of applications.