文章基本信息

标题：Smart Web Crawler to Harvest the Invisible Web World
本地全文：下载
作者：Nimisha Jain ; Pragya Sharma ; Saloni Poddar 等
期刊名称：International Journal of Innovative Research in Computer and Communication Engineering
印刷版ISSN：2320-9798
电子版ISSN：2320-9801
出版年度：2016
卷号：4
期号：4
页码：7490
DOI：10.15680/IJIRCCE.2016.0404230
出版社：S&S Publications
摘要：In a growing world of internet,thousands of web pages are added daily. But only approx. 0.03 percent fraction of web pages are retrieved by all the search engines. The remaining pages are deep websites. Deep web is that part of web which is hidden and unrecognizable by the existing search engines. Ithas been a longstanding challengefor the existing useful crawlers and web search enginesto harvest this ample volume of data. This paper surveys on different methodsof crawlingdeep-web. Density and rapidly changing nature of deep-web has posed a big hurdle in front of researchers.To overcome this issue, we propose a dual-layer framework, namely Smart Web Crawler, for efficiently harvesting deep web interfaces. Smart crawler consists of two layers: Site-discoveryand In-depth Crawling.Site-discoverylayer finds the sparsely located deep websites from given known parent sites using Reverse Searchingand focused crawling. The In-depth Crawlinglayer makes use of Smart Learningand Prioritizingto crawl hyperlinks within a site to ensure wider coverage of web directories
关键词：Deep Website; HTML Form; Reverse Searching; Prioritizing; Smart Learning; Categorizing;Pre-querying; Post-querying