期刊名称:International Journal of Innovative Research in Science, Engineering and Technology
印刷版ISSN:2347-6710
电子版ISSN:2319-8753
出版年度:2017
卷号:6
期号:9
页码:18711
DOI:10.15680/IJIRSET.2017.0609176
出版社:S&S Publications
摘要:The World Wide Web is a massive assemblage of billions of web pages containing terabytes of dataarranged in various servers using HTML. The all-purpose crawlers are challenged extensively at a fast pace from ascaling point of view because of the fast-paced evolution of the internet. A web crawler is a mechanized (automated)tool that traverses the web and extracts webpages for gathering information. In intelligent focus Web crawler, thecrawler starts with a specific defined topic and crawls the relevant webpages based on the defined search criteria. Inthis project, a new intelligent focus crawler has been proposed. A. The goal of the focused crawler is to identify andnotify pages based on the most relevance limiting the search scope to the boundaries of pages that are with the predecidedrelevance factors. This helps in reducing network and hardware resources, in turn leading to cost savings andimproves the efficiency and accuracy of the crawl data stored. For this purpose, it uses” Reverse Searching Strategy”.Keeping this aim in mind a two-level framework is used, for efficient searching and gathering of deep and hidden webinterfaces. In the first stage, it uses search engines to identify main pages which avoid visiting irrelevant pages. Afteridentifying the pages, the intelligent focus web crawler will prioritize the webpages to rank them to be more relevantthan the other based on the search topic. In the second stage, the crawler searches the insides of the websites forrelevant information based on the defined search criteria.HTML and JavaScript parser is developed to deal with thedynamic pages. Moreover, a report on crawled URLs is published after crawling which gives entries of all crawledURLs and errors found.