首页    期刊浏览 2024年10月05日 星期六
登录注册

文章基本信息

  • 标题:Hide the Duplicate WebPages
  • 本地全文:下载
  • 作者:Bolla Anil Kumar ; Satya P Kumar Somayajula
  • 期刊名称:International Journal of Computer Science & Technology
  • 印刷版ISSN:2229-4333
  • 电子版ISSN:0976-8491
  • 出版年度:2011
  • 卷号:2
  • 期号:3
  • 出版社:Ayushmaan Technologies
  • 摘要:Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area.
  • 关键词:Meta data; Info Quilt architecture; Semantic web; Encapsulation;agents; Meta base; correlation agents
国家哲学社会科学文献中心版权所有