文章基本信息

标题：A Novel Architecture for Domain Specific Parallel Crawler
本地全文：下载
作者：Nidhi Tyagi ; Deepti Gupta
期刊名称：Indian Journal of Computer Science and Engineering
印刷版ISSN：2231-3850
电子版ISSN：0976-5166
出版年度：2010
卷号：1
期号：1
页码：44-53
出版社：Engg Journals Publications
摘要：The World Wide Web is an interlinked collection of billions of documents formatted using HTML. Due to the growing and dynamic nature of the web, it has become a challenge to traverse all URLs in the web documents and handle these URLs, so it has become imperative to parallelize a crawling process. The crawler process is further being parallelized in the form ecology of crawler workers that parallely download information from the web. This paper proposes a novel architecture of parallel crawler, which is based on domain specific crawling, makes crawling task more effective, scalable and load-sharing among the different crawlers which parallel download web pages related to different domains specific URLs.
关键词：WWW; URLs; crawling process; parallel crawlers.