首页    期刊浏览 2024年09月18日 星期三
登录注册

文章基本信息

  • 标题:The Study of Detecting Replicate Documents Using MD5 Hash Function
  • 本地全文:下载
  • 作者:Mr. Pushpendra Singh Tomar ; Dr. Maneesh Shreevastava
  • 期刊名称:International Journal of Advanced Computer Research
  • 印刷版ISSN:2249-7277
  • 电子版ISSN:2277-7970
  • 出版年度:2012
  • 卷号:2012
  • 出版社:Association of Computer Communication Education for National Triumph (ACCENT)
  • 摘要:A great deal o f the Web i s replica te o r near- repli cate co ntent. Documents ma y b e served in different formats: HTML, PDF, and Text for different aud iences. Document s may g et mirro red to avoi d delays or to p rovi de fault to lerance. Algo rithms fo r detecti ng replicate document s are critica l in a ppli catio ns where data is o btai ned from mul tiple so urces. The removal o f rep licate documents is necessa ry, no t only to reduce runtime, but al so to improve search accura cy. Toda y, search engine crawlers a re retrieving billio ns o f unique URL 's, o f which hund reds of millio ns a re replicates o f so me form. Thu s, q uick ly id entifying rep licat e detection exp edites indexi ng a nd searching . One vendo r's anal ysis of 1 .2 bi llion URL's resul ted i n 40 0 million exa ct replicates found wit h a MD5 hash. Reducing the collection sizes by t ens of percentage point's resul ts in grea t savings in indexing ti me a nd a reduction in the amount of hardware required to support the system. La st and p roba bly more sig nifi cant, users benefit by elimi nating replica te result s. By efficiently present ing only uniq ue documents, user sa tisfa ction is li kely to increa se
  • 关键词:Un ique documents; Detectin g Repli cate; Replication; Search ;en gine
国家哲学社会科学文献中心版权所有