文章基本信息

标题：Improved Focused Crawler Using Inverted WAH Bitmap Index
本地全文：下载
作者：Sanjay Kumar Singh ; Sonu Agrawal
期刊名称：International Journal of Advanced Research in Computer Engineering & Technology (IJARCET)
印刷版ISSN：2278-1323
出版年度：2012
卷号：1
期号：4
页码：407-409
出版社：Shri Pannalal Research Institute of Technolgy
摘要：Focused Crawlers are software which can traverse the internet and retrieve web pages by hyperlinks according to specific topic. The traditional web crawlers cannot function well to retrieve the relevant pages effectively. The focused crawler is a special-purpose search engine which aims to selectively seek out pages that are relevant. The main characteristic of focused crawling is that the crawler does not need to collect all web pages, but selects and retrieves only the relevant pages. So the major problem is how to retrieve the maximal set of relevant and quality pages. To address this problem, we have designed an Interactive focused crawler which calculates the relevancy of web page. It calculates the URL score for identifying whether a URL is relevant or not for a specific topic. The Interactive Focused Crawler proceeds by gathering pages related to the seed set by using techniques like keyword extraction and search engine query and link neighbourhood expansion. These collected pages are then prompted to the user in a ranked order that facilitates quick elimination of negatives. The user then provides feedback and helps the baseline classifier to be progressively induced using active learning techniques. Once the classifier is in place the crawler can be started on its task of resource discovery.
关键词：classifier; focused crawler; keyword extraction; ; ; URL.