期刊名称:International Journal of Advanced Computer Research
印刷版ISSN:2249-7277
电子版ISSN:2277-7970
出版年度:2011
卷号:2011
出版社:Association of Computer Communication Education for National Triumph (ACCENT)
摘要:A great deal o f the Web i s replica te o r near- repli cate co ntent. Documents ma y b e served in different formats: HTML, PDF, and Text for different aud iences. Document s may g et mirro red to avoi d delays or to p rovi de fault to lerance. Algo rithms fo r detecti ng replicate document s are critica l in a ppli catio ns where data is o btai ned from mul tiple so urces. The removal o f rep licate documents is necessa ry, no t only to reduce runtime, but al so to improve search accura cy. Toda y, search engine crawlers a re retrieving billio ns o f unique URL 's, o f which hund reds of millio ns a re replicates o f so me form. Thu s, q uick ly id entifying rep licat e detection exp edites indexi ng a nd searching . One vendo r's anal ysis of 1 .2 bi llion URL's resul ted i n 40 0 million exa ct replicates found wit h a MD5 hash. Reducing the collection sizes by t ens of percentage point's resul ts in grea t savings in indexing ti me a nd a reduction in the amount of hardware required to support the system. La st and p roba bly more sig nifi cant, users benefit by elimi nating replica te result s. By efficiently present ing only uniq ue documents, user sa tisfa ction is li kely to increa se.