首页    期刊浏览 2025年02月26日 星期三
登录注册

文章基本信息

  • 标题:MCMTCrawler: a Multi-Computer and Multi-Thread Vertical Crawler
  • 本地全文:下载
  • 作者:Ziyun Deng ; Lei Chen ; Tingqin He
  • 期刊名称:Engineering Letters
  • 印刷版ISSN:1816-093X
  • 电子版ISSN:1816-0948
  • 出版年度:2018
  • 卷号:26
  • 期号:3
  • 页码:313-319
  • 出版社:Newswood Ltd
  • 摘要:To optimize the structures of the open source crawlers, improve the performances of the standalone crawlers, we design a new Multi-Computer and Multi-Thread vertical Crawler, called MCMTCrawler. MCMTCrawler can complete the special crawling task on a large business website within a few hours. MCMTCrawler uses Berkeley DB to persist the waiting Uniform Resource Locator (URL) queue and the downloaded URL queue. MD5 algorithm is applied to map a URL to a 32-length string. MCMTCrawler employs the Producer-Consumer model to assign and process the URLs. Based on the design ideas of Aspect-Oriented Programming (AOP) and Dependency Injection (DI) of Spring, the scheduler and the downloader of MCMTCrawler are designed separately for speeding up the crawler. According to the experimental results, when using three downloaded servers, the speed of MCMTCrawler is five times as much as that of the single-computer and single-process crawler, and three times of the single-computer and multi-thread crawler called Crawler4j. Furthermore, for handling the task of crawling 600,000 web pages, MCMTCrawler takes only 6.83 hours.
  • 关键词:MCMTCrawler; Multi-Computer and Multi-Thread; Vertical Crawler; Design idea
国家哲学社会科学文献中心版权所有