首页    期刊浏览 2025年05月25日 星期日
登录注册

文章基本信息

  • 标题:Review on Extracting the Web Data through Deep Web Interfaces, Mechanism
  • 本地全文:下载
  • 作者:Anand Kumar ; Rahul Kumar ; Sachin Nigle
  • 期刊名称:International Journal of Innovative Research in Computer and Communication Engineering
  • 印刷版ISSN:2320-9798
  • 电子版ISSN:2320-9801
  • 出版年度:2016
  • 卷号:4
  • 期号:1
  • 页码:473
  • DOI:10.15680/IJIRCCE.2016.0401101
  • 出版社:S&S Publications
  • 摘要:As deep web grows at a very fast pace, there has been increased interest in techniques that help efficiently lo cate deep - web interfaces. However, due to the large volume of web resources and the dynamic nature of deep web, achieving wide coverage and high efficiency is a challenging issue. We propose a two - stage framework, namely SmartCrawler , for efficient harvesting deep web interfaces. In the first stage, SmartCrawler performs site - based searching for center pages with the help of search engines, avoiding visiting a large number of pages. To achieve more accurate results for a focused crawl , SmartCrawler ranks websites to prioritize highly relevant ones for a given topic. In the second stage, SmartCrawler achieves fast in - site searching by excavating most relevant links with an adaptive link - ranking. To eliminate bias on visiting some highly relevant links in hidden web directories, we design a link tree data structure to achieve wider coverage for a website. Our experimental results on a set of representative domains show the agility and accuracy of our proposed crawler framework, which effi ciently retrieves deep - web interfaces from large - scale sites and achieves higher harvest rates than other crawlers
  • 关键词:Web Crawler; Robot Elimination protocol; Design Issues; Policies and Algorithm; Crawl Techniques
国家哲学社会科学文献中心版权所有