首页    期刊浏览 2024年11月26日 星期二
登录注册

文章基本信息

  • 标题:An Efficient Approach for Finding Near Duplicate Web pages using Minimum Weight Overlapping Method
  • 其他标题:An Efficient Approach for Finding Near Duplicate Web pages using Minimum Weight Overlapping Method
  • 作者:Shine N Das ; Midhun Mathew ; Pramod K. Vijayaraghavan
  • 期刊名称:International Journal of Electrical and Computer Engineering
  • 电子版ISSN:2088-8708
  • 出版年度:2011
  • 卷号:1
  • 期号:2
  • 页码:187-194
  • 语种:English
  • 出版社:Institute of Advanced Engineering and Science (IAES)
  • 摘要:The existence of billions of web data has severely affected the performance and reliability of web search. The presence of near duplicate web pages plays an important role in this performance degradation while integrating data from heterogeneous sources. Web mining faces huge problems due to the existence of such documents. These pages increase the index storage space and thereby increase the serving cost. By introducing efficient methods to detect and remove such documents from the Web not only decreases the computation time but also increases the relevancy of search results. We aim a novel idea for finding near duplicate web pages which can be incorporated in the field of plagiarism detection, spam detection and focused web crawling scenarios. Here we propose an efficient method for finding near duplicates of an input web page, from a huge repository. A TDW matrix based algorithm is proposed with three phases, rendering, filtering and verification, which receives an input web page and a threshold in its first phase, prefix filtering and positional filtering to reduce the size of record set in the second phase and returns an optimal set of near duplicate web pages in the verification phase by using Minimum Weight Overlapping (MWO) method. The experimental results show that our algorithm outperforms in terms of two benchmark measures, precision and recall, and a reduction in the size of competing record set.DOI:http://dx.doi.org/10.11591/ijece.v1i2.77
  • 其他摘要:The existence of billions of web data has severely affected the performance and reliability of web search. The presence of near duplicate web pages plays an important role in this performance degradation while integrating data from heterogeneous sources. Web mining faces huge problems due to the existence of such documents. These pages increase the index storage space and thereby increase the serving cost. By introducing efficient methods to detect and remove such documents from the Web not only decreases the computation time but also increases the relevancy of search results. We aim a novel idea for finding near duplicate web pages which can be incorporated in the field of plagiarism detection, spam detection and focused web crawling scenarios. Here we propose an efficient method for finding near duplicates of an input web page, from a huge repository. A TDW matrix based algorithm is proposed with three phases, rendering, filtering and verification, which receives an input web page and a threshold in its first phase, prefix filtering and positional filtering to reduce the size of record set in the second phase and returns an optimal set of near duplicate web pages in the verification phase by using Minimum Weight Overlapping (MWO) method. The experimental results show that our algorithm outperforms in terms of two benchmark measures, precision and recall, and a reduction in the size of competing record set. DOI: http://dx.doi.org/10.11591/ijece.v1i2.77
Loading...
联系我们|关于我们|网站声明
国家哲学社会科学文献中心版权所有