首页    期刊浏览 2024年11月08日 星期五
登录注册

文章基本信息

  • 标题:DRTX: A Duplicate Resolution Tool for XML Repositories
  • 本地全文:下载
  • 作者:Randa Mohamed Abd El-ghfar ; Ali EL-Bastawissy ; Moustafa Abd Elazeem
  • 期刊名称:International Journal of Computer Science and Network Security
  • 印刷版ISSN:1738-7906
  • 出版年度:2012
  • 卷号:12
  • 期号:7
  • 页码:42-49
  • 出版社:International Journal of Computer Science and Network Security
  • 摘要:Detecting duplicates in XML is not trivial due to structural diversity and object dependency. This paper suggests a duplicate detection and resolution tool (DRTX) which is an efficient XML duplicates detector and resolution that applies two famous techniques of duplicates detection, normal edit distance (NED) and token based damerau-levenshtein distance algorithm (TBED) then compare the results and suggests the better similarity for each of them. DRTX is not only a duplicate detection and resolution system but it also provides two extra services: - first the XML file merger which is used to merge XML documents thus solves the structure heterogeneity problem, second dirty XML generator which is used to insert known duplicate problems on clean XML file to apply the mentioned algorithms on that file therefore explore how much the system can detect accurately these problems.To minimize the number of pair-wise element duplicates comparison, a set of filters were used to increase the efficiency of DRTX while its effectiveness is adjustable. Experimental results show that there is no algorithm better than the other but each of them has its own use ie.NED is better to use at lower threshold similarity values while TBED is better at higher ones.
  • 关键词:Duplicates detection; XML; similarity; Data cleaning; efficiency and effectiveness of detection Algorithms
国家哲学社会科学文献中心版权所有