期刊名称:International Journal of Computer Technology and Applications
电子版ISSN:2229-6093
出版年度:2012
卷号:3
期号:6
页码:2066-2072
出版社:Technopark Publications
摘要:We propose a highly efficient and scalable duplicate-search technique based on hash algorithm, Cloud-based computing is an emerging practice that offers significantly more infrastructure and financial flexibility than traditional computing models which requires very low computational cost and memory cost. Larger enterprises may have implemented very strong security approaches that may or may not be equaled by cloud providers, but don't just assume that security is a problem. Look for the type of security functionality you would look for in an in-house solution. A documents may get mirrored to avoid delays or to provide fault tolerance. Our algorithm RDDA for detecting replicate documents are critical in applications where data is obtained from multiple sources. The removal of replicate documents is necessary, not only to reduce run time, but also to improve search accuracy. Today, search engine crawlers are retrieving billions of unique URL’s, of which hundreds of millions are replicates of some form. This function rapidly compares large numbers of files for identical content by computing the SHA-256 hash of each file and detecting replicates. The probability of two non-identical files having the same hash, even in a hypothetical directory containing millions of files, is exceedingly remote. By efficiently presenting only unique documents, user satisfaction is likely to increase.