首页    期刊浏览 2024年11月28日 星期四
登录注册

文章基本信息

  • 标题:Smart Web Crawler to Harvest the Invisible Web World
  • 本地全文:下载
  • 作者:Nimisha Jain ; Pragya Sharma ; Saloni Poddar
  • 期刊名称:International Journal of Innovative Research in Computer and Communication Engineering
  • 印刷版ISSN:2320-9798
  • 电子版ISSN:2320-9801
  • 出版年度:2016
  • 卷号:4
  • 期号:4
  • 页码:7490
  • DOI:10.15680/IJIRCCE.2016.0404230
  • 出版社:S&S Publications
  • 摘要:In a growing world of internet,thousands of web pages are added daily. But only approx. 0.03 percent fraction of web pages are retrieved by all the search engines. The remaining pages are deep websites. Deep web is that part of web which is hidden and unrecognizable by the existing search engines. Ithas been a longstanding challengefor the existing useful crawlers and web search enginesto harvest this ample volume of data. This paper surveys on different methodsof crawlingdeep-web. Density and rapidly changing nature of deep-web has posed a big hurdle in front of researchers.To overcome this issue, we propose a dual-layer framework, namely Smart Web Crawler, for efficiently harvesting deep web interfaces. Smart crawler consists of two layers: Site-discoveryand In-depth Crawling.Site-discoverylayer finds the sparsely located deep websites from given known parent sites using Reverse Searchingand focused crawling. The In-depth Crawlinglayer makes use of Smart Learningand Prioritizingto crawl hyperlinks within a site to ensure wider coverage of web directories
  • 关键词:Deep Website; HTML Form; Reverse Searching; Prioritizing; Smart Learning; Categorizing;Pre-querying; Post-querying
国家哲学社会科学文献中心版权所有