期刊名称:International Journal of Computer Technology and Applications
电子版ISSN:2229-6093
出版年度:2012
卷号:3
期号:1
页码:231-234
出版社:Technopark Publications
摘要:In this paper, the provenance matrix is refined to get more accuracy and efficiency in detecting near-duplicates by adding two more factors ‘How’ and ‘Why’ , as the performance of the web search depends on the search results having information without duplicates or redundancy . More redundancy leads to more time consume and more storage, that’s why search engines try to avoid indexing of duplicates documents. Provenance model combines both the content-based and trust-based factors for classifying near-duplicates or original documents, as now a days, many of near-duplicates are from the distrusted websites