首页    期刊浏览 2024年11月28日 星期四
登录注册

文章基本信息

  • 标题:A Novel Architecture for Domain Specific Parallel Crawler
  • 本地全文:下载
  • 作者:Nidhi Tyagi ; Deepti Gupta
  • 期刊名称:Indian Journal of Computer Science and Engineering
  • 印刷版ISSN:2231-3850
  • 电子版ISSN:0976-5166
  • 出版年度:2010
  • 卷号:1
  • 期号:1
  • 页码:44-53
  • 出版社:Engg Journals Publications
  • 摘要:The World Wide Web is an interlinked collection of billions of documents formatted using HTML. Due to the growing and dynamic nature of the web, it has become a challenge to traverse all URLs in the web documents and handle these URLs, so it has become imperative to parallelize a crawling process. The crawler process is further being parallelized in the form ecology of crawler workers that parallely download information from the web. This paper proposes a novel architecture of parallel crawler, which is based on domain specific crawling, makes crawling task more effective, scalable and load-sharing among the different crawlers which parallel download web pages related to different domains specific URLs.
  • 关键词:WWW; URLs; crawling process; parallel crawlers.
国家哲学社会科学文献中心版权所有